CN117311777A

CN117311777A - Automatic operation and maintenance platform and method

Info

Publication number: CN117311777A
Application number: CN202311097268.7A
Authority: CN
Inventors: 饶品波; 王鑫; 陈若鹏; 蒋强; 张岩
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-12-29

Abstract

The invention discloses an automatic operation and maintenance platform and method, and belongs to the technical field of operation and maintenance. The invention sets an operation monitoring module, an operation maintenance module and a script issuing module in an automatic operation maintenance platform, wherein the operation monitoring module collects the operation data of the system and analyzes the operation data to obtain the current operation state of the system, and determines whether the system needs to be automatically operated and maintained according to the current operation state; when the system needs to be automatically operated, the operation and maintenance module matches a corresponding fault processing template from a preset operation and maintenance knowledge base according to the current operation state and sends the fault processing template to the script issuing module; the script issuing module configures the corresponding operation and maintenance script according to the fault processing template and issues and loads the operation and maintenance script, so that the operation and maintenance script can be automatically updated and iterated according to actual needs, the operation and maintenance automation is realized, the operation and maintenance efficiency is improved, meanwhile, manual updating and iterating of a script library are not needed, and the operation cost is greatly saved.

Description

Automated operation and maintenance platform and method

技术领域Technical field

本发明涉及运维技术领域，尤其涉及一种自动化运维平台及方法。The present invention relates to the field of operation and maintenance technology, and in particular to an automated operation and maintenance platform and method.

背景技术Background technique

随着大数据、云计算的快速发展，相关IT基础设施的规模和复杂性都在急速增加，并成为支撑各行各业业务顺利开展的关键保障，而规范运维这些IT基础设施，确保其稳定、高效、安全的运行，确保IT服务可靠性对业务运行的支撑，确保IT服务的敏捷性对数字化转型的支撑，成为了一个重要的课题。With the rapid development of big data and cloud computing, the scale and complexity of related IT infrastructure are increasing rapidly, and it has become a key guarantee to support the smooth development of businesses in various industries. Standard operation and maintenance of these IT infrastructures ensures their stability. , efficient and safe operation, ensuring the reliability of IT services to support business operations, and ensuring the agility of IT services to support digital transformation have become an important topic.

现有的自动化运维监控平台的建设周期长，建设成本高，自身运维的工作也很复杂。The existing automated operation and maintenance monitoring platform has a long construction period, high construction costs, and its own operation and maintenance work is also very complicated.

发明内容Contents of the invention

本发明的主要目的在于提供一种自动化运维平台及方法，旨在解决现有技术运维平台建设成本高且效率低的技术问题。The main purpose of the present invention is to provide an automated operation and maintenance platform and method, aiming to solve the technical problems of high construction cost and low efficiency of the existing technical operation and maintenance platform.

为实现上述目的，本发明提供了一种自动化运维平台，所述自动化运维平台包括：运行监控模块、运维模块以及脚本发布模块；In order to achieve the above purpose, the present invention provides an automated operation and maintenance platform, which includes: an operation monitoring module, an operation and maintenance module and a script publishing module;

所述运行监控模块，用于采集系统的运行数据，并对所述运行数据进行分析，得到系统的当前运行状态，并根据所述当前运行状态确定系统是否需要进行自动化运维；The operation monitoring module is used to collect the operation data of the system, analyze the operation data, obtain the current operation status of the system, and determine whether the system requires automated operation and maintenance based on the current operation status;

所述运维模块，用于在所述系统需要进行自动化运维时，根据所述当前运行状态从预设运维知识库中匹配对应的故障处理模板，并将所述故障处理模板发送至所述脚本发布模块；The operation and maintenance module is used to match the corresponding fault processing template from the preset operation and maintenance knowledge base according to the current operating status when the system needs to perform automated operation and maintenance, and send the fault processing template to the Described script publishing module;

所述脚本发布模块，用于根据所述故障处理模板配置对应的运维脚本，并将所述运维脚本进行发布加载。The script publishing module is used to configure the corresponding operation and maintenance script according to the fault processing template, and publish and load the operation and maintenance script.

可选地，所述脚本发布模块包括：自动脚本编写发布子模块；Optionally, the script publishing module includes: an automatic script writing and publishing sub-module;

所述自动脚本编写发布子模块，用于根据所述故障处理模板在预设运维脚本库中查询与故障类型匹配的运维脚本，并在查询到与所述故障类型匹配的运维脚本时，将所述运维脚本进行发布加载；The automatic script writing and publishing sub-module is used to query the preset operation and maintenance script library for operation and maintenance scripts matching the fault type according to the fault processing template, and when the operation and maintenance script matching the fault type is queried, , publish and load the operation and maintenance script;

所述自动脚本编写发布子模块，还用于在未查询为与所述故障类型匹配的运维脚本时，根据所述故障处理模板自动配置生成运维脚本，并对所述运维脚本进行审核以及发布加载。The automatic script writing and publishing sub-module is also used to automatically configure and generate an operation and maintenance script according to the fault handling template when no operation and maintenance script matching the fault type is queried, and to review the operation and maintenance script. and release loading.

可选地，所述自动脚本编写发布子模块，还用于根据所述故障处理模板确定故障处理方法，并根据所述故障处理方法确定对应的故障处理代码结构；Optionally, the automatic script writing and publishing sub-module is also used to determine a fault handling method according to the fault handling template, and determine a corresponding fault handling code structure according to the fault handling method;

所述自动脚本编写发布子模块，还用于根据所述故障处理代码结构确定对应的代码参数，并将所述代码参数填入所述故障处理代码结构中，按照预设的脚本生成规则生成运维脚本。The automatic script writing and publishing sub-module is also used to determine corresponding code parameters according to the fault handling code structure, fill in the code parameters into the fault handling code structure, and generate operations according to preset script generation rules. dimensional script.

可选地，所述脚本发布模块，还包括：脚本维护子模块以及脚本更新管理子模块；Optionally, the script publishing module also includes: a script maintenance sub-module and a script update management sub-module;

所述脚本维护子模块，用于预设周期对当前运维脚本执行后系统的当前状态信息进行监控，检测所述当前状态信息与预设系统状态是否一致；The script maintenance sub-module is used to monitor the current status information of the system after the current operation and maintenance script is executed at a preset period, and detect whether the current status information is consistent with the preset system status;

所述脚本维护子模块，还用于在所述当前状态信息与预设系统状态不一致时，生成更新信息至所述脚本更新管理子模块；The script maintenance sub-module is also used to generate update information to the script update management sub-module when the current status information is inconsistent with the preset system status;

所述脚本更新管理子模块，用于基于所述更新信息根据所述当前状态信息与所述预设系统状态得到状态差值，并根据所述状态差值对当前运维脚本进行更新。The script update management submodule is configured to obtain a status difference between the current status information and the preset system status based on the update information, and update the current operation and maintenance script based on the status difference.

可选地，所述脚本维护子模块，还用于获取运维脚本运行状态监测模型，并根据所述当前运行状态得到故障信息；Optionally, the script maintenance sub-module is also used to obtain the operation and maintenance script running status monitoring model, and obtain fault information according to the current running status;

所述脚本维护子模块，还用于将所述故障信息以及当前运维脚本输入至所述运维脚本运行状态监测模型，得到执行当前运维脚本后的预设系统状态。The script maintenance sub-module is also used to input the fault information and the current operation and maintenance script into the operation and maintenance script running status monitoring model to obtain the preset system state after executing the current operation and maintenance script.

可选地，所述脚本维护子模块，还用于获取历史运维数据，并将所述历史运维数据进行预处理，得到历史运维数据对应的训练数据；Optionally, the script maintenance sub-module is also used to obtain historical operation and maintenance data, and preprocess the historical operation and maintenance data to obtain training data corresponding to the historical operation and maintenance data;

所述脚本维护子模块，还用于根据所述训练数据进行训练，得到运维脚本运行状态监测模型。The script maintenance sub-module is also used to perform training based on the training data to obtain an operation and maintenance script running status monitoring model.

可选地，所述运维模块，还用于获取系统历史运维数据，并对所述系统历史运维数据进行预处理，得到运维数据文本集合；Optionally, the operation and maintenance module is also used to obtain system historical operation and maintenance data, and preprocess the system historical operation and maintenance data to obtain a text collection of operation and maintenance data;

所述运维模块，还用于对所述运维数据文本集合进行数据关联规则挖掘，将运维数据文本集合中的历史运维数据转换为运维知识数据；The operation and maintenance module is also used to perform data association rule mining on the operation and maintenance data text collection, and convert historical operation and maintenance data in the operation and maintenance data text collection into operation and maintenance knowledge data;

所述运维模块，还用于根据所述运维知识数据建立预设运维知识库。The operation and maintenance module is also used to establish a preset operation and maintenance knowledge base according to the operation and maintenance knowledge data.

可选地，所述运维模块，还用于在所述预设运维知识库中建立搜索引擎，并将运维知识数据存储至预设运维知识库中；Optionally, the operation and maintenance module is also used to establish a search engine in the preset operation and maintenance knowledge base, and store the operation and maintenance knowledge data in the preset operation and maintenance knowledge base;

所述运维模块，还用于在根据所述运行状态得到告警信息，并通过语义分析确定所述告警信息对应的故障类型；The operation and maintenance module is also used to obtain alarm information according to the operating status, and determine the fault type corresponding to the alarm information through semantic analysis;

所述运维模块，还用于根据所述故障类型在所述预设运维知识库中匹配与所述故障类型对应的故障处理模板，并将所述故障处理模板发送至所述脚本发布模块。The operation and maintenance module is also configured to match a fault processing template corresponding to the fault type in the preset operation and maintenance knowledge base according to the fault type, and send the fault processing template to the script publishing module .

可选地，所述运行监控模块，还用于采集系统的历史故障数据以及历史运行状态数据；Optionally, the operation monitoring module is also used to collect historical fault data and historical operating status data of the system;

所述运行监控模块，还用于根据所述历史故障数据以及所述历史运行状态数据进行训练，建立告警预测模型；The operation monitoring module is also used to perform training based on the historical fault data and the historical operating status data and establish an alarm prediction model;

所述运行监控模块，还用于将所述运行数据输入至所述告警预测模型，得到系统的当前运行状态预测值；The operation monitoring module is also used to input the operation data into the alarm prediction model to obtain the current operation status prediction value of the system;

所述运行监控模块，还用于将所述当前运行状态预测值与预设告警阈值进行比较，并在所述当前运行状态预测值大于所述预设告警阈值时，确定系统需要进行自动化运维；The operation monitoring module is also used to compare the current operating status prediction value with a preset alarm threshold, and determine that the system needs to perform automated operation and maintenance when the current operating status prediction value is greater than the preset alarm threshold. ;

所述运行监控模块，还用于在所述当前运行状态预测值小于等于所述预设告警阈值时，确定系统不需要进行自动化运维。The operation monitoring module is also used to determine that the system does not require automated operation and maintenance when the current operating status prediction value is less than or equal to the preset alarm threshold.

此外，为实现上述目的，本发明还提出一种自动化运维方法，所述自动化运维方法应用于上文所述的自动化运维平台，所述自动化运维方法包括以下步骤：In addition, in order to achieve the above purpose, the present invention also proposes an automated operation and maintenance method, which is applied to the automated operation and maintenance platform described above. The automated operation and maintenance method includes the following steps:

运行监控模块采集系统的运行数据，并对所述运行数据进行分析，得到系统的当前运行状态，并根据所述当前运行状态确定系统是否需要进行自动化运维；The operation monitoring module collects the operation data of the system, analyzes the operation data, obtains the current operation status of the system, and determines whether the system requires automated operation and maintenance based on the current operation status;

运维模块在所述系统需要进行自动化运维时，根据所述当前运行状态从预设运维知识库中匹配对应的故障处理模板，并将所述故障处理模板发送至脚本发布模块；When the system needs to perform automated operation and maintenance, the operation and maintenance module matches the corresponding fault processing template from the preset operation and maintenance knowledge base according to the current operating status, and sends the fault processing template to the script publishing module;

所述脚本发布模块根据所述故障处理模板配置对应的运维脚本，并将所述运维脚本进行发布加载。The script publishing module configures the corresponding operation and maintenance script according to the fault handling template, and publishes and loads the operation and maintenance script.

本发明通过在自动化运维平台中设置运行监控模块、运维模块以及脚本发布模块，所述运行监控模块，用于采集系统的运行数据，并对所述运行数据进行分析，得到系统的当前运行状态，并根据所述当前运行状态确定系统是否需要进行自动化运维；所述运维模块，用于在所述系统需要进行自动化运维时，根据所述当前运行状态从预设运维知识库中匹配对应的故障处理模板，并将所述故障处理模板发送至所述脚本发布模块；所述脚本发布模块，用于根据所述故障处理模板配置对应的运维脚本，并将所述运维脚本进行发布加载，从而实现运维脚本根据实际需要而自动进行升级迭代，实现运维自动化，提高运维效率，同时不需要对脚本库进行人工的更新迭代，极大地节约了运营成本。The present invention sets an operation monitoring module, an operation and maintenance module and a script publishing module in the automated operation and maintenance platform. The operation monitoring module is used to collect the operation data of the system and analyze the operation data to obtain the current operation of the system. status, and determine whether the system needs to perform automated operation and maintenance according to the current operating status; the operation and maintenance module is used to, when the system needs to perform automated operation and maintenance, obtain the preset operation and maintenance knowledge base from the preset operation and maintenance knowledge base according to the current operating status. Match the corresponding fault processing template in the fault processing template, and send the fault processing template to the script publishing module; the script publishing module is used to configure the corresponding operation and maintenance script according to the fault processing template, and send the operation and maintenance script to the fault processing template. The script is released and loaded, so that the operation and maintenance script can be automatically upgraded and iterated according to actual needs, realizing operation and maintenance automation and improving operation and maintenance efficiency. At the same time, there is no need to manually update and iterate the script library, which greatly saves operating costs.

附图说明Description of the drawings

图1是本发明自动化运维平台第一实施例的结构示意图；Figure 1 is a schematic structural diagram of the first embodiment of the automated operation and maintenance platform of the present invention;

图2为本发明自动化运维平台第二实施例的结构示意图；Figure 2 is a schematic structural diagram of the second embodiment of the automated operation and maintenance platform of the present invention;

图3为本发明自动化运维平台一实施例中自动脚本编写发布子模块自动配置运维脚本的流程示意图；Figure 3 is a schematic flow chart of the automatic script writing and publishing sub-module automatically configuring the operation and maintenance script in one embodiment of the automated operation and maintenance platform of the present invention;

图4为本发明自动化运维平台第三实施例的结构示意图；Figure 4 is a schematic structural diagram of the third embodiment of the automated operation and maintenance platform of the present invention;

图5为本发明自动化运维平台一实施例中预设运维知识库建立示意图；Figure 5 is a schematic diagram of establishing a preset operation and maintenance knowledge base in an embodiment of the automated operation and maintenance platform of the present invention;

图6为本发明自动化运维方法第一实施例的流程示意图；Figure 6 is a schematic flow chart of the first embodiment of the automated operation and maintenance method of the present invention;

图7为本发明自动化运维方法一实施例中自动化运维方法整体流程示意图。Figure 7 is a schematic diagram of the overall flow of the automated operation and maintenance method in one embodiment of the present invention.

附图标号说明：Explanation of reference numbers:

标号label 名称name 标号label 名称name 1010 运行监控模块Run monitoring module 301301 自动脚本编写发布子模块Automated scripting publishing submodules 2020 运维模块Operation and maintenance module 302302 脚本维护子模块Script maintenance submodule 3030 脚本发布模块Script publishing module 303303 脚本更新管理子模块Script update management submodule

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further described with reference to the embodiments and the accompanying drawings.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention.

参照图1，图1为自动化运维平台第一实施例的结构示意图。Referring to Figure 1, Figure 1 is a schematic structural diagram of a first embodiment of an automated operation and maintenance platform.

在本实施例中，所述自动化运维平台包括：运行监控模块10、运维模块20以及脚本发布模块30。In this embodiment, the automated operation and maintenance platform includes: an operation monitoring module 10, an operation and maintenance module 20, and a script publishing module 30.

需要说明的是，本实施例的自动化运维平台主要包括三部分，具体包括为运行监控模块10、运维模块20以及脚本发布模块30。It should be noted that the automated operation and maintenance platform of this embodiment mainly includes three parts, specifically including an operation monitoring module 10, an operation and maintenance module 20, and a script publishing module 30.

在具体实施中，所述运行监控模块10，用于采集系统的运行数据，并对所述运行数据进行分析，得到系统的当前运行状态，并根据所述当前运行状态确定系统是否需要进行自动化运维。In a specific implementation, the operation monitoring module 10 is used to collect the operation data of the system, analyze the operation data, obtain the current operation status of the system, and determine whether the system needs to perform automated operation according to the current operation status. dimension.

运行监控模块10主要用于采集平台或者系统的运行数据，运行数据中包括有平台或系统的运行状态、业务数据以及服务数据等相关运行参数，并将采集到的运行数据进行分析，得到系统的当前运行状态。系统的运行状态数据可以包括该系统中各应用模块的异常监测数据以及该系统中各个应用模块的使用率数据。The operation monitoring module 10 is mainly used to collect operation data of the platform or system. The operation data includes relevant operation parameters such as the operation status, business data and service data of the platform or system, and analyzes the collected operation data to obtain the system's operation data. Current running status. The operating status data of the system may include abnormal monitoring data of each application module in the system and usage data of each application module in the system.

具体地，该应用模块的异常监测数据可以是基于ELK实时日志分析系统获取的日志监控数据，日志监控数据用于记录的系统中各个应用模块(或者设备)的处理器、内存和存储器的异常数据，在日志监控数据中可以用0表示无异常，1表示异常。当然，根据实际需要，被监控对象不仅仅限于处理器、内存和存储器，也可以包括其它组件，如网络接口、电源系统等等。ELK由ElasticSearch、Logstash和Kiabana三个开源工具组成，使用ELK获取日志监控数据的具体方法在此不赘述。另外，该应用模块的使用率数据可以是基于Zabbix分布式监控系统获取的主机监控数据，如所述目标设备的处理器使用率、内存使用率、存储器使用率等，或者该使用率数据还可以包括系统上当前正在运行的应用程序的应用监控数据，例如可以基于Datadog或New Relic等工具对系统上正在运行的Java虚拟机中进程的内存、内存线程数量、磁盘IO、索引的读取/写入操作等指标参数进行采集。而该系统的业务数据以及服务数据一般可以是指与该系统对外提供的业务服务相关的数据，该数据可以从系统的业务日志或者服务日志上查询获取。Specifically, the abnormal monitoring data of the application module can be log monitoring data obtained based on the ELK real-time log analysis system. The log monitoring data is used to record the abnormal data of the processor, memory and storage of each application module (or device) in the system. , in the log monitoring data, 0 can be used to indicate no abnormality, and 1 can be used to indicate abnormality. Of course, according to actual needs, the monitored objects are not limited to processors, memory and storage, but can also include other components, such as network interfaces, power systems, etc. ELK consists of three open source tools: ElasticSearch, Logstash, and Kiabana. The specific method of using ELK to obtain log monitoring data will not be described here. In addition, the usage data of the application module can be host monitoring data obtained based on the Zabbix distributed monitoring system, such as the processor usage, memory usage, storage usage, etc. of the target device, or the usage data can also be Including application monitoring data of applications currently running on the system. For example, you can use tools such as Datadog or New Relic to read/write the memory, number of memory threads, disk IO, and index of the process in the Java virtual machine running on the system. Indicator parameters such as input operations are collected. The business data and service data of the system generally refer to data related to the business services provided externally by the system, and the data can be queried and obtained from the business log or service log of the system.

系统的当前运行状态包括正常运行状态、异常运行状态等。The current operating status of the system includes normal operating status, abnormal operating status, etc.

当运行监控模块10得到系统的当前运行状态后，可基于当前运行状态确定系统是否需要进行自动化运维，通过对当前运行状态进行风险预测以及业务告警，例如是否有突发事件或业务故障等，从而确定系统是否需要进行自动化运维。After the operation monitoring module 10 obtains the current operating status of the system, it can determine whether the system needs to perform automated operation and maintenance based on the current operating status. By performing risk prediction and business alarms on the current operating status, such as whether there are emergencies or business failures, etc., This determines whether the system requires automated operation and maintenance.

例如系统当前运行状态为正常运行状态，则系统不需要进行自动化运维，若系统的当前运行状态为异常运行状态，则系统需要进行自动化运维。For example, if the current operating state of the system is a normal operating state, the system does not need to perform automated operation and maintenance. If the current operating state of the system is an abnormal operating state, the system needs to perform automated operation and maintenance.

在本实施例中，所述运维模块20，用于在所述系统需要进行自动化运维时，根据所述当前运行状态从预设运维知识库中匹配对应的故障处理模板，并将所述故障处理模板发送至所述脚本发布模块30。In this embodiment, the operation and maintenance module 20 is used to match the corresponding fault processing template from the preset operation and maintenance knowledge base according to the current operating status when the system needs to perform automated operation and maintenance, and assign the corresponding fault processing template to the preset operation and maintenance knowledge base. The fault handling template is sent to the script publishing module 30.

应理解的是，当系统需要进行自动化运维时，系统的当前运行状态为异常运行状态，运行监控模块10可输出系统的告警信息等数据，并将此告警信息发送至运维模块20，运维模块20主要用于根据运行监控模块10输出的告警信息，在预设运维知识库中匹配对应的故障处理模板，从而将故障处理模板发送至脚本发布模块30。It should be understood that when the system needs to perform automated operation and maintenance, the current operating state of the system is an abnormal operating state, and the operation monitoring module 10 can output the alarm information and other data of the system, and send the alarm information to the operation and maintenance module 20. The maintenance module 20 is mainly used to match the corresponding fault processing template in the preset operation and maintenance knowledge base according to the alarm information output by the operation monitoring module 10, thereby sending the fault processing template to the script publishing module 30.

预设运维知识库可提前建立，预设运维知识库可以是基于机器学习算法，通过采集到的历史运维数据进行状态挖掘分析，进而根据分析结果构建的运维数据库。The preset operation and maintenance knowledge base can be established in advance. The preset operation and maintenance knowledge base can be based on a machine learning algorithm, perform status mining and analysis on the collected historical operation and maintenance data, and then build an operation and maintenance database based on the analysis results.

预设运维知识库中存储有多个故障处理模板，当需要确定故障处理模板时，可根据当前运行状态中的告警信息从预设运维知识库中的多个故障处理模板中匹配相对应的故障处理模板。There are multiple fault handling templates stored in the preset operation and maintenance knowledge base. When it is necessary to determine the fault handling template, the corresponding fault handling template can be matched from the multiple fault handling templates in the preset operation and maintenance knowledge base based on the alarm information in the current running status. Troubleshooting template.

进一步地，当接收到运行监控模块10输出的当前运行状态中的告警信息后，可通过语义分析确定该告警信息所对应的故障类型，从而可以在预设运维知识库中检索对应的故障处理模板，并将检索到的故障处理模板发送至脚本发布模块30。Further, after receiving the alarm information in the current operating state output by the operation monitoring module 10, the fault type corresponding to the alarm information can be determined through semantic analysis, so that the corresponding fault processing can be retrieved in the preset operation and maintenance knowledge base. template, and sends the retrieved fault handling template to the script publishing module 30.

所述脚本发布模块30，用于根据所述故障处理模板配置对应的运维脚本，并将所述运维脚本进行发布加载。The script publishing module 30 is configured to configure the corresponding operation and maintenance script according to the fault processing template, and publish and load the operation and maintenance script.

需要说明的是，脚本发布模块30可根据故障处理模板配置对应的运维脚本，从而将运维脚本进行发布加载。It should be noted that the script publishing module 30 can configure the corresponding operation and maintenance script according to the fault processing template, thereby publishing and loading the operation and maintenance script.

应理解的是，在进行运维脚本配置之前，脚本发布模块30还会在已有的预设运维脚本库中查找是否存在与当前的故障类型匹配的运维脚本，当查询到存在相匹配的运维脚本时，执行该运维脚本，以实现对故障的处理。It should be understood that before configuring the operation and maintenance script, the script publishing module 30 will also search whether there is an operation and maintenance script matching the current fault type in the existing preset operation and maintenance script library. When the query finds that there is a matching operation and maintenance script, When running an operation and maintenance script, execute the operation and maintenance script to handle the fault.

本实施例通过在自动化运维平台中设置运行监控模块10、运维模块20以及脚本发布模块30，运行监控模块10采集系统的运行数据，并对运行数据进行分析，得到系统的当前运行状态，并根据当前运行状态确定系统是否需要进行自动化运维；运维模块20在系统需要进行自动化运维时，根据当前运行状态从预设运维知识库中匹配对应的故障处理模板，并将故障处理模板发送至脚本发布模块30；脚本发布模块30根据故障处理模板配置对应的运维脚本，并将运维脚本进行发布加载，从而实现运维脚本根据实际需要而自动进行升级迭代，实现运维自动化，提高运维效率，同时不需要对脚本库进行人工的更新迭代，极大地节约了运营成本。In this embodiment, the operation monitoring module 10, the operation and maintenance module 20 and the script publishing module 30 are set up in the automated operation and maintenance platform. The operation monitoring module 10 collects the operating data of the system and analyzes the operating data to obtain the current operating status of the system. And determine whether the system needs to perform automated operation and maintenance according to the current operating status; when the system needs to perform automated operation and maintenance, the operation and maintenance module 20 matches the corresponding fault processing template from the preset operation and maintenance knowledge base according to the current operating status, and handles the fault. The template is sent to the script publishing module 30; the script publishing module 30 configures the corresponding operation and maintenance script according to the fault processing template, and publishes and loads the operation and maintenance script, thereby realizing the automatic upgrade and iteration of the operation and maintenance script according to actual needs, and realizing operation and maintenance automation. , improves operation and maintenance efficiency, and does not require manual update and iteration of the script library, which greatly saves operating costs.

参考图2，图2为本发明自动化运维平台第二实施例的结构示意图。Referring to Figure 2, Figure 2 is a schematic structural diagram of the second embodiment of the automated operation and maintenance platform of the present invention.

基于上述第一实施例，本实施例中所述脚本发布模块30包括：自动脚本编写发布子模块301；所述自动脚本编写发布子模块301，用于根据所述故障处理模板在预设运维脚本库中查询与故障类型匹配的运维脚本，并在查询到与所述故障类型匹配的运维脚本时，将所述运维脚本进行发布加载。Based on the above first embodiment, the script publishing module 30 in this embodiment includes: an automatic script writing and publishing sub-module 301; the automatic script writing and publishing sub-module 301 is used to perform preset operation and maintenance according to the fault handling template. Query the operation and maintenance script that matches the fault type in the script library, and when the operation and maintenance script that matches the fault type is found, publish and load the operation and maintenance script.

需要说明的是，预设运维脚本库为提前建立的存储有多重对故障进行处理的运维脚本。It should be noted that the default operation and maintenance script library is created in advance and stores multiple operation and maintenance scripts for handling faults.

自动该脚本编写发布子模块301在获取到故障处理模板后，可根据故障处理模板在预设运维脚本库中确定与故障类型匹配的运维脚本，从而确定是否查询到与故障类型匹配的运维脚本。After obtaining the fault processing template, the automatic script writing and publishing sub-module 301 can determine the operation and maintenance script matching the fault type in the preset operation and maintenance script library according to the fault processing template, thereby determining whether the operation and maintenance script matching the fault type is queried. dimensional script.

在具体实施中，当查询到预设运维脚本库中存在与故障类型匹配的运维脚本时，可将运维脚本直接进行发布加载。In a specific implementation, when it is queried that there is an operation and maintenance script matching the fault type in the preset operation and maintenance script library, the operation and maintenance script can be directly released and loaded.

在具体实施中，所述自动脚本编写发布子模块301，还用于在未查询为与所述故障类型匹配的运维脚本时，根据所述故障处理模板自动配置生成运维脚本，并对所述运维脚本进行审核以及发布加载。In a specific implementation, the automatic script writing and publishing sub-module 301 is also used to automatically configure and generate an operation and maintenance script according to the fault handling template when the operation and maintenance script matching the fault type is not queried, and perform the operation and maintenance script on all the problems. The above operation and maintenance scripts are reviewed and released for loading.

需要说明的是，当自动脚本编写发布子模块301未查询到与故障类型匹配的运维脚本时，自动脚本编写发布子模块301会根据故障处理目标自动配置生成运维脚本，并在对该运维脚本进行审核通过后，对该运维脚本进行发布加载，并执行该运维脚本，实现故障处理。It should be noted that when the automatic script writing and publishing sub-module 301 does not query the operation and maintenance script that matches the fault type, the automatic script writing and publishing sub-module 301 will automatically configure and generate an operation and maintenance script according to the fault processing target, and then perform the operation and maintenance script on the operation and maintenance script. After the maintenance script is reviewed and approved, the operation and maintenance script is released and loaded, and the operation and maintenance script is executed to implement fault handling.

自动脚本编写发布子模块301自动配置运维脚本可根据故障处理模板确定对应的代码，从而进行运维脚本自动配置，则所述自动脚本编写发布子模块301，还用于根据所述故障处理模板确定故障处理方法，并根据所述故障处理方法确定对应的故障处理代码结构；所述自动脚本编写发布子模块301，还用于根据所述故障处理代码结构确定对应的代码参数，并将所述代码参数填入所述故障处理代码结构中，按照预设的脚本生成规则生成运维脚本。The automatic script writing and publishing sub-module 301 automatically configures the operation and maintenance script to determine the corresponding code according to the fault processing template, thereby automatically configuring the operation and maintenance script. The automatic script writing and publishing sub-module 301 is also used to configure the operation and maintenance script according to the fault processing template. Determine the fault handling method, and determine the corresponding fault handling code structure according to the fault handling method; the automatic script writing and publishing sub-module 301 is also used to determine the corresponding code parameters according to the fault handling code structure, and write the The code parameters are filled in the fault handling code structure, and the operation and maintenance script is generated according to the preset script generation rules.

应理解的是，自动脚本编写发布子模块301可根据获取到的故障处理模板确定故障处理方法，进而根据故障处理方法确定对应的故障代码结构，根据预设运维知识库确定的故障处理模板中包含有故障对应的处理方法，而在系统中运维脚本存在大量复用代码，且部分协议代码和通用方法代码内容和结构固定，另外相同处理方法所对应的处理代码结构往往也是相同的。因此自动脚本编写发布子模块301可以通过对历史运维脚本进行统计分析，确定不同故障处理方法所对应的代码结构，进而构建运维脚本代码库。该运维脚本代码库中包括若干个预先进行模块化封装的多个代码片段模块。It should be understood that the automatic script writing and publishing sub-module 301 can determine the fault processing method according to the acquired fault processing template, and then determine the corresponding fault code structure according to the fault processing method. In the fault processing template determined according to the preset operation and maintenance knowledge base, It contains the processing method corresponding to the fault, and there is a large amount of reused code in the operation and maintenance script in the system, and the content and structure of some protocol codes and general method codes are fixed. In addition, the processing code structure corresponding to the same processing method is often the same. Therefore, the automatic script writing and publishing sub-module 301 can perform statistical analysis on historical operation and maintenance scripts to determine the code structures corresponding to different fault handling methods, and then build an operation and maintenance script code library. The operation and maintenance script code base includes several code snippet modules that are modularly encapsulated in advance.

则自动脚本编写发布子模块301在确定故障处理模板中包含有故障对应的处理方法后，可以在运维脚本代码库中分别确定各个处理方法所对应的代码片段，进而通过将这些代码片段进行组合的方式，生成故障处理代码结构。Then, after determining that the fault processing template contains the processing method corresponding to the fault, the automatic script writing and publishing sub-module 301 can determine the code fragments corresponding to each processing method in the operation and maintenance script code library, and then combine these code fragments method to generate a fault handling code structure.

所述自动脚本编写发布子模块301可确定故障处理代码结构中各部分对应的代码参数，确定代码参数的方式可通过当前系统状态以及故障类型确定，还可通过运维人员人工输入的方式。对于通过当前系统状态以及故障类型确定代码参数，可根据历史运维参数以及历史系统状态数据作为训练样本，预先训练运维脚本发布模型，进而通过将当前系统状态以及故障类型输入该运维脚本发布模型，根据运维脚本发布模型的输出结果，确定代码参数。其中，该历史运维数据以及历史系统状态数据可以包括：系统的CPU、内存、网卡监控数据、服务器的每秒查询率、请求延时、网站登录信息，此外还可以根据业务需要设置其他类型的运维数据。The automatic script writing and publishing sub-module 301 can determine the code parameters corresponding to each part in the fault handling code structure. The method of determining the code parameters can be determined by the current system status and fault type, or by manual input by operation and maintenance personnel. For determining the code parameters based on the current system status and fault type, the historical operation and maintenance parameters and historical system status data can be used as training samples to pre-train the operation and maintenance script release model, and then enter the current system status and fault type into the operation and maintenance script release Model, publish the output results of the model according to the operation and maintenance script, and determine the code parameters. Among them, the historical operation and maintenance data and historical system status data can include: system CPU, memory, network card monitoring data, server query rate per second, request delay, website login information, and other types of data can also be set according to business needs. Operation and maintenance data.

基于该些历史数据构建训练样本集，通过提取训练样本集中各个训练数据的特征值，构建基于运维系统数据的随机森林模型和深度神经网络模型。利用训练样本集中的训练数据对该深度神经网络模型进行训练，进而得到运维脚本发布模型，将当前系统状态以及故障类型输入该运维脚本发布模型，可以得到相应的代码参数。A training sample set is constructed based on these historical data, and a random forest model and a deep neural network model based on operation and maintenance system data are constructed by extracting the characteristic values of each training data in the training sample set. Use the training data in the training sample set to train the deep neural network model, and then obtain the operation and maintenance script release model. Enter the current system status and fault type into the operation and maintenance script release model to obtain the corresponding code parameters.

除了可以通过上述方式自动生成代码参数外，还允许运维人员通过堡垒机浏览器登录统一脚本发布模块对应的后台运维脚本编写页面，进而通过人工输入的方式，为故障处理代码结构各部分填写代码参数。In addition to automatically generating code parameters through the above method, operation and maintenance personnel are also allowed to log in to the background operation and maintenance script writing page corresponding to the unified script release module through the bastion machine browser, and then fill in each part of the fault handling code structure through manual input. code parameters.

在得到代码参数后，自动脚本编写发布子模块301可将代码参数填入所选取的故障处理代码结构，进而按照预设的脚本生产规则生成运维脚本，其中，预设的脚本生成规则指的是脚本语言的语法规则和模块的组合规则，脚本语言的语法规则用于保证脚本不会出现语法错误，模块的组合规则用于保证脚本不会出现逻辑上的错误，脚本语言的语法规则包括例如各种变量定义、函数定义、格式要求等等(具体的规则可以依据具体的脚本所使用的语言来进行定义)，模块的组合规则例如模块组合的先后顺序、嵌套结构等等。After obtaining the code parameters, the automatic script writing and publishing sub-module 301 can fill in the code parameters into the selected fault handling code structure, and then generate an operation and maintenance script according to the preset script production rules, where the preset script generation rules refer to It is the syntax rules of the script language and the combination rules of the modules. The syntax rules of the script language are used to ensure that the script will not have grammatical errors. The combination rules of the module are used to ensure that the script will not have logical errors. The syntax rules of the script language include, for example, Various variable definitions, function definitions, format requirements, etc. (specific rules can be defined according to the language used by the specific script), module combination rules such as the order of module combination, nested structure, etc.

如图3所示，图3为自动脚本编写发布子模块301自动配置运维脚本的流程示意图，可通过运维人员人工输入代码参数或模型自动生成代码参数，自动脚本编写发布子模块301将代码参数填入故障处理代码结构，从而生成运维脚本，并对运维脚本进行审核，具体包括安全扫描、黑白名单策略中查询以及业务审计策略审核，并在审核通过后，将运维脚本发布加载。As shown in Figure 3, Figure 3 is a schematic process diagram of the automatic configuration of operation and maintenance scripts by the automatic script writing and publishing sub-module 301. Code parameters can be automatically generated by operation and maintenance personnel manually inputting code parameters or models, and the automatic script writing and publishing sub-module 301 will The parameters are filled in the fault handling code structure to generate the operation and maintenance script, and the operation and maintenance script is reviewed, including security scanning, blacklist policy query, and business audit policy review. After the review is passed, the operation and maintenance script is released and loaded. .

本实施例通过在脚本发布模块30中设置自动脚本编写发布子模块301；所述自动脚本编写发布子模块301，用于根据所述故障处理模板在预设运维脚本库中查询与故障类型匹配的运维脚本，并在查询到与所述故障类型匹配的运维脚本时，将所述运维脚本进行发布加载；所述自动脚本编写发布子模块301，还用于在未查询为与所述故障类型匹配的运维脚本时，根据所述故障处理模板自动配置生成运维脚本，并对所述运维脚本进行审核以及发布加载，从而自动对运维脚本进行配置，提高运维脚本配置的灵活性和适用性。In this embodiment, an automatic script writing and publishing sub-module 301 is set up in the script publishing module 30; the automatic script writing and publishing sub-module 301 is used to query and match the fault type in the preset operation and maintenance script library according to the fault processing template. The operation and maintenance script, and when the operation and maintenance script matching the fault type is queried, the operation and maintenance script is published and loaded; the automatic script writing and publishing sub-module 301 is also used to query the operation and maintenance script that matches the fault type. When describing an operation and maintenance script that matches the fault type, the operation and maintenance script is automatically configured and generated according to the fault processing template, and the operation and maintenance script is reviewed and released for loading, thereby automatically configuring the operation and maintenance script and improving the operation and maintenance script configuration. flexibility and applicability.

参考图4，图4为本发明自动化运维平台第三实施例的结构示意图。Referring to Figure 4, Figure 4 is a schematic structural diagram of the third embodiment of the automated operation and maintenance platform of the present invention.

基于上述第一实施例，本实施例中所述脚本发布模块30还包括：脚本维护子模块302以及脚本更新管理子模块303；Based on the above first embodiment, the script publishing module 30 in this embodiment also includes: a script maintenance sub-module 302 and a script update management sub-module 303;

所述脚本维护子模块302，用于预设周期对当前运维脚本执行后系统的当前状态信息进行监控，检测所述当前状态信息与预设系统状态是否一致；The script maintenance sub-module 302 is used to monitor the current status information of the system after the current operation and maintenance script is executed at a preset period, and detect whether the current status information is consistent with the preset system status;

需要说明的是，当自动脚本编写发布子模块301发布运维脚本后，由于加载的运维脚本是根据预设运维知识库以及运维脚本发布模型自动生成的，由于模型训练的原因，可能会出现自动生成的运维脚本不适用当前系统故障，进而导致执行运维脚本后系统无法恢复到期望状态的问题，因此在脚本发布模块30中还设置脚本维护子模块302，从而可通过脚本维护子模块302对运维脚本进行监控，从而确保运维脚本没有故障。It should be noted that when the automatic script writing and publishing sub-module 301 publishes the operation and maintenance script, since the loaded operation and maintenance script is automatically generated based on the preset operation and maintenance knowledge base and the operation and maintenance script publishing model, due to model training, it may There will be a problem that the automatically generated operation and maintenance script is not applicable to the current system failure, which will lead to the problem that the system cannot be restored to the desired state after executing the operation and maintenance script. Therefore, a script maintenance sub-module 302 is also set up in the script publishing module 30, so that the script can be maintained The sub-module 302 monitors the operation and maintenance scripts to ensure that there are no faults in the operation and maintenance scripts.

预设周期可以小时或天进行设置，例如每3小时对当前运维脚本执行后系统的当前状态信息进行监控，可获取当前运维脚本执行后系统的当前状态信息数据，从而确定当前状态信息是否与预设系统状态一致，进而确定该运维脚本是否有效。预设系统状态为根据当前运维脚本执行后系统的理想或期望状态。The preset period can be set in hours or days. For example, the current status information of the system after the execution of the current operation and maintenance script is monitored every 3 hours. The current status information data of the system after the execution of the current operation and maintenance script can be obtained to determine whether the current status information is It is consistent with the preset system status to determine whether the operation and maintenance script is valid. The default system state is the ideal or expected state of the system after the current operation and maintenance script is executed.

在本实施例中，可通过运维脚本运行状态监测模型获取预设系统状态，则所述脚本维护子模块302，还用于获取运维脚本运行状态监测模型，并根据所述当前运行状态得到故障信息；所述脚本维护子模块302，还用于将所述故障信息以及当前运维脚本输入至所述运维脚本运行状态监测模型，得到执行当前运维脚本后预设系统状态。In this embodiment, the preset system status can be obtained through the operation and maintenance script running status monitoring model, and the script maintenance sub-module 302 is also used to obtain the operation and maintenance script running status monitoring model, and obtain Fault information; the script maintenance sub-module 302 is also used to input the fault information and the current operation and maintenance script into the operation and maintenance script running status monitoring model to obtain the preset system status after executing the current operation and maintenance script.

需要说明的是，在确定了运维脚本后，可将此运维脚本以及故障信息输入至运维脚本运行状态监测模型，从而可以预测得到执行该运维脚本后的期望系统状态。It should be noted that after the operation and maintenance script is determined, the operation and maintenance script and fault information can be input into the operation and maintenance script running status monitoring model, so that the expected system state after executing the operation and maintenance script can be predicted.

进一步地，运维脚本运行状态监测模型可提前建立，因此脚本维护子模块302，还用于获取历史运维数据，并将所述历史运维数据进行预处理，得到历史运维数据对应的训练数据；所述脚本维护子模块302，还用于根据所述训练数据进行训练，得到运维脚本运行状态监测模型。Furthermore, the operation and maintenance script running status monitoring model can be established in advance, so the script maintenance sub-module 302 is also used to obtain historical operation and maintenance data, and preprocess the historical operation and maintenance data to obtain training corresponding to the historical operation and maintenance data. Data; the script maintenance sub-module 302 is also used to perform training based on the training data to obtain an operation and maintenance script running status monitoring model.

需要说明的是，历史运维数据包括：故障类型、故障系统状态信息、运维脚本以及执行运维脚本后的系统状态信息。通过对历史运维数据进行预处理，得到历史运维数据对应的训练数据。其中，预处理主要包括：通过数据清洗技术，填充缺失值，检查并清除噪声值和异常值；通过数据降维技术，利用数据的监督方式来达到目标数据的降维效果；通过文本清洗技术，删除冗余特征或聚类消除多余数据；通过数据离散化技术，降低数据存储空间，为特定的机器学习方法将连续值转换为类别特征；通过数据归一化处理技术，对抽取出来的特征向量进行归一化处理。经过预处理后的运维特征数据集可以分为两部分，其中70-90％的数据集作为训练集用来训练模型，10-30％的数据集作为测试集用来评估模型效果。则使用其中70％-90％的数据集作为训练集进行运维脚本运行状态监测模型的训练。It should be noted that historical operation and maintenance data includes: fault type, fault system status information, operation and maintenance scripts, and system status information after executing the operation and maintenance scripts. By preprocessing historical operation and maintenance data, training data corresponding to historical operation and maintenance data is obtained. Among them, preprocessing mainly includes: filling missing values, checking and removing noise values and outliers through data cleaning technology; using data dimensionality reduction technology, using data supervision methods to achieve the dimensionality reduction effect of the target data; using text cleaning technology, Delete redundant features or cluster to eliminate redundant data; use data discretization technology to reduce data storage space and convert continuous values into categorical features for specific machine learning methods; use data normalization processing technology to extract feature vectors Perform normalization processing. The preprocessed operation and maintenance feature data set can be divided into two parts, of which 70-90% of the data set is used as a training set to train the model, and 10-30% of the data set is used as a test set to evaluate the model effect. Then use 70%-90% of the data set as the training set to train the operation and maintenance script running status monitoring model.

具体地，可以使用决策树CART分析法，采用基尼系数作为特征划分的度量，预测时，在树的内部节点处用某一属性值进行判断，根据判断结果决定进入哪个分支节点，直到到达叶节点处，得到分类结果。通过CART剪枝算法从完全生长的决策树底端减去一些子树，使决策树模型变简单，从而能够对未知数据有更准确的预测。Specifically, the decision tree CART analysis method can be used, using the Gini coefficient as a measure of feature division. When predicting, a certain attribute value is used to make a judgment at the internal node of the tree, and based on the judgment result, it is decided which branch node to enter until the leaf node is reached. at , get the classification results. The CART pruning algorithm is used to subtract some subtrees from the bottom of the fully grown decision tree to simplify the decision tree model, thereby enabling more accurate predictions of unknown data.

根据训练数据，从根结点开始，通过输入训练数据集以及计算停止条件的方式，递归地对每个结点进行操作，进而输出二叉决策树；根据CART剪枝算法，将二叉决策树作为输入，输出最优决策树。通过上述过程，进而完成运维脚本运行状态监测模型的训练。According to the training data, starting from the root node, each node is recursively operated by inputting the training data set and calculating the stopping condition, and then outputs a binary decision tree; according to the CART pruning algorithm, the binary decision tree is As input, the optimal decision tree is output. Through the above process, the training of the operation and maintenance script running status monitoring model is completed.

通过训练得到的运维脚本运行状态监测模型，可以预测在执行运维脚本后，当运维脚本正常运行时，系统在各个时段正常的状态信息。Through the trained operation and maintenance script running status monitoring model, it is possible to predict the normal status information of the system in each period after the operation and maintenance script is executed and when the operation and maintenance script runs normally.

例如，监测到系统异常故障为：“系统内存不足(异常占用率90％)，导致内存读写失败”，且统一脚本发布平台基于该故障生成了运维脚本a，在运维脚本a执行后，通过运维脚本运行状态监测模型，可以预测出执行该运维脚本后，各时段用户期望的系统状态，假设分别为：“预测状态a：运维脚本执行一小时，系统内存占用率80％；预测状态b：运维脚本执行两小时，系统内存占用率60％；测状态c：运维脚本执行三小时，系统内存占用率50％；预测状态d：运维脚本执行四小时，系统内存占用率40％，内存读写正常，故障消除”，则统一脚本发布平台可以按照预测状态的时间周期，在运维脚本发布执行后，周期性地采集当前系统状态数据，进而通过比较当前系统状态数据与预测状态，确定该运维脚本是否有效，当判断结果为是时，则继续执行该运维脚本，并继续保持周期性地监测，直至故障消除，运维结束，且统一脚本发布平台可以将该运维脚本与对应的故障类型一并发布至预设运维脚本库中保存，以便后续在出现相似故障时，可以直接从预设运维脚本库中选择对应的运维脚本进行处理。For example, the abnormal system fault detected is: "Insufficient system memory (abnormal occupancy rate is 90%), resulting in memory read and write failure", and the unified script publishing platform generated operation and maintenance script a based on this fault. After operation and maintenance script a is executed, , through the operation and maintenance script running status monitoring model, the system status expected by the user in each period after executing the operation and maintenance script can be predicted. The assumptions are: "Predicted status a: The operation and maintenance script is executed for one hour, and the system memory usage is 80%. ; Prediction state b: The operation and maintenance script has been executed for two hours, and the system memory usage is 60%; Measurement status c: The operation and maintenance script has been executed for three hours, and the system memory usage is 50%; Prediction status d: The operation and maintenance script has been executed for four hours, and the system memory usage is 50% "The occupancy rate is 40%, memory reading and writing is normal, and the fault is eliminated", then the unified script publishing platform can periodically collect the current system status data according to the time period of the predicted status after the operation and maintenance script is released and executed, and then compare the current system status Data and predicted status are used to determine whether the operation and maintenance script is valid. When the judgment result is yes, the operation and maintenance script continues to be executed and periodic monitoring continues until the fault is eliminated, operation and maintenance is completed, and the unified script publishing platform can Publish the operation and maintenance script and the corresponding fault type to the preset operation and maintenance script library for storage, so that when similar faults occur later, the corresponding operation and maintenance script can be directly selected from the preset operation and maintenance script library for processing.

在本实施例中，所述脚本维护子模块302，还用于在所述当前状态信息与预设系统状态不一致时，生成更新信息至所述脚本更新管理子模块303；所述脚本更新管理子模块303，用于基于所述更新信息根据所述当前状态信息与所述预设系统状态得到状态差值，并根据所述状态差值对当前运维脚本进行更新。In this embodiment, the script maintenance sub-module 302 is also used to generate update information to the script update management sub-module 303 when the current status information is inconsistent with the preset system status; the script update management sub-module 303 Module 303 is configured to obtain a status difference value based on the current status information and the preset system status based on the update information, and update the current operation and maintenance script according to the status difference value.

在当前状态信息与预设系统状态不一致时，需要对当前运维脚本进行更新，因此脚本维护子模块302可生成更新信息至脚本更新管理子模块303，从而通过脚本更新管理子模块303对当前运维脚本进行更新。When the current status information is inconsistent with the preset system status, the current operation and maintenance script needs to be updated. Therefore, the script maintenance sub-module 302 can generate update information to the script update management sub-module 303, so that the current operation and maintenance script can be updated through the script update management sub-module 303. Dimension scripts are updated.

当脚本更新管理子模块303接收到更新信息后，可根据当前状态信息与预设系统状态得到当前系统状态值与预测系统状态值，从而计算当前系统状态值与预测系统状态值之间的差值，得到状态差值，并通过状态差值对当前运维脚本进行更新，终止执行之前的运维脚本，重新发布并执行更新后的运维脚本，并通过脚本维护子模块302循环执行预设周期对当前运维脚本执行后系统的当前状态信息进行监控，检测所述当前状态信息与预设系统状态是否一致的过程。After the script update management sub-module 303 receives the update information, it can obtain the current system status value and the predicted system status value based on the current status information and the preset system status, thereby calculating the difference between the current system status value and the predicted system status value. , obtain the status difference, and update the current operation and maintenance script through the status difference, terminate the execution of the previous operation and maintenance script, re-release and execute the updated operation and maintenance script, and execute the preset cycle through the script maintenance sub-module 302 The process of monitoring the current status information of the system after the current operation and maintenance script is executed, and detecting whether the current status information is consistent with the preset system status.

具体地，脚本更新管理子模块303可以根据当前系统状态值与预测系统状态值之间的差值，利用预先训练好的运维脚本状态监测模型确定对应的优化参数(包括对运维代码结构的优化参数以及对具体代码参数的优化)。进而统一脚本发布模块可以根据该优化参数，对原运维脚本的代码进行更新迭代，进而得到更新迭代后的运维脚本，并对该运维脚本进行迭代发布，以对脚本库中的原始运维脚本进行迭代更新。Specifically, the script update management sub-module 303 can use the pre-trained operation and maintenance script status monitoring model to determine the corresponding optimization parameters (including optimization of the operation and maintenance code structure) based on the difference between the current system status value and the predicted system status value. Optimization parameters and optimization of specific code parameters). Then the unified script publishing module can update and iterate the code of the original operation and maintenance script according to the optimization parameters, thereby obtaining the updated and iterated operation and maintenance script, and iteratively publish the operation and maintenance script to update the original operation and maintenance script in the script library. Dimension scripts are updated iteratively.

本实施例通过在脚本发布模块30中还设置脚本维护子模块302以及脚本更新管理子模块303；所述脚本维护子模块302，用于预设周期对当前运维脚本执行后系统的当前状态信息进行监控，检测所述当前状态信息与预设系统状态是否一致；所述脚本维护子模块302，还用于在所述当前状态信息与预设系统状态不一致时，生成更新信息至所述脚本更新管理子模块303；所述脚本更新管理子模块303，用于基于所述更新信息根据所述当前状态信息与所述预设系统状态得到状态差值，并根据所述状态差值对当前运维脚本进行更新，脚本维护子模块302自动根据系统状态以及历史运维数据，对当前正在运行的运维脚本的工作效果进行评估分析，当判断当前正在运行的运维脚本并不能按照系统预期去解决故障的情况下，脚本更新管理子模块303可以自动对该运维脚本进行更新升级，并将更新后的运维脚本进行重新发布，从而可以实现运维脚本根据实际需要而自动进行升级迭代，从而真正实现运维自动化。In this embodiment, a script maintenance sub-module 302 and a script update management sub-module 303 are also provided in the script publishing module 30; the script maintenance sub-module 302 is used to obtain the current status information of the system after the current operation and maintenance script is executed in a preset period Monitor and detect whether the current status information is consistent with the preset system status; the script maintenance sub-module 302 is also used to generate update information to the script update when the current status information is inconsistent with the preset system status. Management sub-module 303; the script update management sub-module 303 is used to obtain the status difference between the current status information and the preset system status based on the update information, and perform current operation and maintenance based on the status difference. The script is updated, and the script maintenance sub-module 302 automatically evaluates and analyzes the working effect of the currently running operation and maintenance script based on the system status and historical operation and maintenance data. When it is judged that the currently running operation and maintenance script cannot be solved as expected by the system In the event of a failure, the script update management sub-module 303 can automatically update and upgrade the operation and maintenance script, and re-release the updated operation and maintenance script, so that the operation and maintenance script can be automatically upgraded and iterated according to actual needs, thereby Realize operation and maintenance automation.

基于上述第一实施例，提出本发明自动化运维平台第四实施例。Based on the above first embodiment, a fourth embodiment of the automated operation and maintenance platform of the present invention is proposed.

基于上述第一实施例，本实施例中所述运维模块20，还用于获取系统历史运维数据，并对所述系统历史运维数据进行预处理，得到运维数据文本集合；所述运维模块20，还用于对所述运维数据文本集合进行数据关联规则挖掘，将运维数据文本集合中的历史运维数据转换为运维知识数据；所述运维模块20，还用于根据所述运维知识数据建立预设运维知识库。Based on the above first embodiment, the operation and maintenance module 20 in this embodiment is also used to obtain system historical operation and maintenance data, and preprocess the system historical operation and maintenance data to obtain an operation and maintenance data text set; The operation and maintenance module 20 is also used to perform data association rule mining on the operation and maintenance data text set, and convert the historical operation and maintenance data in the operation and maintenance data text set into operation and maintenance knowledge data; the operation and maintenance module 20 is also used to Establishing a preset operation and maintenance knowledge base based on the operation and maintenance knowledge data.

需要说明的是，运维模块20可根据系统历史运维数据建立预设运维知识库，该系统历史运维数据包括：故障文本描述、运维处理方案以及与运维处理方案对应的处理结果。如图5所示，图5为预设运维知识库建立示意图，通过收集系统历史运维数据，具体包括IP、资产化学以及内外部数据等，通过对系统历史运维数据进行预处理，可以对获取到的历史运维数据进行统一格式转换以及数据清理等预处理，得到初始运维数据文本集合；紧接着可以针对运维数据文本集合中的数据进行停用词去除，得到参考运维数据文本集合；接下来，对参考运维文本集合中的文本进行标记，可以使用自然语言处理技术中的BIO(BeginIntermediate Other)标注方法，对参考运维数据文本集合中的运维语句进行序列标注，为每一个语料实体标记上对应的类别信息，例如：“进程检测故障”可以标注为：进(B-NA)程(B-NA)检(I-VB)测(I-VB)故(I-NB)障(I-NB)，得到运维数据文本集合。通过对运维数据文本集合进行关联关系挖掘运算之前，将每个标记好的实体提前用向量的形式来代表其在原数据集中的位置、标记等信息，可以将标记好的文本集输入word2vec模型中，利用word2vec框架下的Skip-Gram和CBOW(Continuous Bag-of-Words)模型对输入的数据进行向量形式的转换，将运维数据文本转换成为运维词向量集。It should be noted that the operation and maintenance module 20 can establish a preset operation and maintenance knowledge base based on the system's historical operation and maintenance data. The system's historical operation and maintenance data includes: fault text description, operation and maintenance processing plan, and processing results corresponding to the operation and maintenance processing plan. . As shown in Figure 5, Figure 5 is a schematic diagram for establishing a preset operation and maintenance knowledge base. By collecting system historical operation and maintenance data, including IP, asset chemistry, internal and external data, etc., and preprocessing system historical operation and maintenance data, you can Perform preprocessing such as unified format conversion and data cleaning on the acquired historical operation and maintenance data to obtain the initial operation and maintenance data text collection; then stop words can be removed for the data in the operation and maintenance data text collection to obtain the reference operation and maintenance data. Text collection; next, mark the text in the reference operation and maintenance text collection. You can use the BIO (BeginIntermediate Other) annotation method in natural language processing technology to sequence the operation and maintenance statements in the reference operation and maintenance data text collection. Mark the corresponding category information for each corpus entity. For example: "Process detection failure" can be marked as: progress (B-NA) process (B-NA) detection (I-VB) detection (I-VB) reason (I -NB) barrier (I-NB) to obtain the operation and maintenance data text collection. Before performing association relationship mining operations on the operation and maintenance data text collection, each marked entity is represented in the form of a vector in advance in the form of a vector, including its position, label and other information in the original data set. The marked text set can be input into the word2vec model. , using the Skip-Gram and CBOW (Continuous Bag-of-Words) models under the word2vec framework to convert the input data into vector form, and convert the operation and maintenance data text into an operation and maintenance word vector set.

可以将运维数据文本集合进行数据关联规则挖掘，将历史运维数据转换为运维知识数据，具体地，可以将运维数据文本转换成为的运维词向量集输入关联度挖掘算法(比如generateRules函数，或者Apriori算法)，获得频繁项集与置信度。将该运维词向量集输入Bi-LSTM模型中，以获得带有上下文信息的词向量矩阵；将该词向量矩阵输入TransE模型中，再通过欧几里得公式联合计算，获得初级关系阈值；重复进行初级关系阈值的计算，进而根据多个关系阈值结果，确定关系阈值的临界值；将该频繁项集、置信度以及临界值共同传入关联规则挖掘算法模型中。The operation and maintenance data text collection can be used for data association rule mining, and historical operation and maintenance data can be converted into operation and maintenance knowledge data. Specifically, the operation and maintenance word vector set can be converted from the operation and maintenance data text and input into the association mining algorithm (such as generateRules function, or Apriori algorithm) to obtain frequent itemsets and confidence. Input the operation and maintenance word vector set into the Bi-LSTM model to obtain a word vector matrix with contextual information; input the word vector matrix into the TransE model, and then jointly calculate it through the Euclidean formula to obtain the primary relationship threshold; The calculation of the primary relationship threshold is repeated, and then the critical value of the relationship threshold is determined based on multiple relationship threshold results; the frequent item set, confidence level and critical value are jointly passed into the association rule mining algorithm model.

根据定理“对于频繁项集l的每一个非空子集x，计算confidence(x→(l-x))，如果confidence(x→(l-x))confmin，那么规则x→(l-x)”成立，因此，关联规则挖掘算法通过扫描频繁项集，得到每一个子集，并计算置信度，当置信度满足条件(即大于等于最小置信度)时，生成一条规则，最终得到运维数据文本集合中各个运维实体之间的关联关系，进而根据该关联关系确定运维知识数据。如下表1所示，表1为获取的运维知识数据表。According to the theorem "For each non-empty subset x of frequent itemset l, calculate confidence(x→(l-x)), if confidence(x→(l-x))confmin, then the rule x→(l-x)" holds, therefore, the association The rule mining algorithm obtains each subset by scanning frequent item sets, and calculates the confidence. When the confidence meets the conditions (that is, greater than or equal to the minimum confidence), a rule is generated, and finally each operation and maintenance in the operation and maintenance data text collection is obtained. Association relationships between entities, and then determine operation and maintenance knowledge data based on the association relationships. As shown in Table 1 below, Table 1 is the obtained operation and maintenance knowledge data table.

表1Table 1

序号serial number 故障信息accident details 类型type 处理方案Solutions 结果result 11 进程中断Process interruption AA 方案1plan 1 恢复recover 22 阻塞block BB 方案2Scenario 2 恢复recover 33 负载load CC 方案3Option 3 恢复recover

应理解的是，当获取得到运维知识数据后，可建立预设运维知识库。It should be understood that after the operation and maintenance knowledge data is obtained, a preset operation and maintenance knowledge base can be established.

在本实施例中，所述运维模块20，还用于在所述预设运维知识库中建立搜索引擎，并将运维知识数据存储至预设运维知识库中。所述运维模块20，还用于在根据所述运行状态得到告警信息，并通过语义分析确定所述告警信息对应的故障类型；所述运维模块20，还用于根据所述故障类型在所述预设运维知识库中匹配与所述故障类型对应的故障处理模板，并将所述故障处理模板发送至所述脚本发布模块30。In this embodiment, the operation and maintenance module 20 is also used to establish a search engine in the preset operation and maintenance knowledge base, and store the operation and maintenance knowledge data in the preset operation and maintenance knowledge base. The operation and maintenance module 20 is also used to obtain alarm information according to the operating status, and determine the fault type corresponding to the alarm information through semantic analysis; the operation and maintenance module 20 is also used to obtain alarm information according to the fault type. The fault processing template corresponding to the fault type is matched in the preset operation and maintenance knowledge base, and the fault processing template is sent to the script publishing module 30 .

可以理解的是，运维模块20还用于在预设运维知识库内建立搜索引擎，采用产生式表示法将获取到的运维知识数据保存至预设运维知识库中。It can be understood that the operation and maintenance module 20 is also used to establish a search engine in the preset operation and maintenance knowledge base, and use production representation to save the acquired operation and maintenance knowledge data into the preset operation and maintenance knowledge base.

当运维模块20接收到运行状态后，可根据运行状态得到告警信息，并通过语义分析确定该告警信息所对应的故障类型，进而可以在预设运维知识库中检索对应的故障处理模板，将故障处理模板发送至脚本发布模块30。After the operation and maintenance module 20 receives the running status, the alarm information can be obtained according to the running status, and the fault type corresponding to the alarm information can be determined through semantic analysis, and then the corresponding fault processing template can be retrieved in the preset operation and maintenance knowledge base. Send the fault handling template to the script publishing module 30.

本实施例所述运维模块20，还用于获取系统历史运维数据，并对所述系统历史运维数据进行预处理，得到运维数据文本集合；所述运维模块20，还用于对所述运维数据文本集合进行数据关联规则挖掘，将运维数据文本集合中的历史运维数据转换为运维知识数据；所述运维模块20，还用于根据所述运维知识数据建立预设运维知识库，通过提前建立预设运维知识库，从而可快速根据故障类型在预设运维知识库中匹配对应的故障处理模板，提高处理效率。The operation and maintenance module 20 in this embodiment is also used to obtain system historical operation and maintenance data, and preprocess the system historical operation and maintenance data to obtain an operation and maintenance data text collection; the operation and maintenance module 20 is also used to obtain Conduct data association rule mining on the operation and maintenance data text set, and convert the historical operation and maintenance data in the operation and maintenance data text set into operation and maintenance knowledge data; the operation and maintenance module 20 is also used to analyze the operation and maintenance knowledge data based on the operation and maintenance knowledge data. Establish a preset operation and maintenance knowledge base. By establishing a preset operation and maintenance knowledge base in advance, you can quickly match the corresponding fault processing template in the preset operation and maintenance knowledge base according to the fault type to improve processing efficiency.

基于上述第一实施例，提出本发明自动化运维平台第五实施例。Based on the above first embodiment, a fifth embodiment of the automated operation and maintenance platform of the present invention is proposed.

基于上述第一实施例，本实施例中所述运行监控模块10，还用于采集系统的历史故障数据以及历史运行状态数据。Based on the above first embodiment, the operation monitoring module 10 in this embodiment is also used to collect historical fault data and historical operating status data of the system.

在具体实施中，运行监控模块10在得到运行数据后，还可对运行数据进行分析，从而确定系统当前运行状态，确定系统当前运行状态可根据告警预测模型进行检测，告警预测模型可提前建立得到，因此运行监控模块10还用于在建立告警预测模型时采集历史故障数据以及历史运行状态数据，其中，该历史故障数据可以包括：该系统历史运行过程中所发生过的故障类型或者告警名称等。In a specific implementation, after obtaining the operation data, the operation monitoring module 10 can also analyze the operation data to determine the current operating status of the system. The current operating status of the system can be detected according to the alarm prediction model, and the alarm prediction model can be established in advance. , therefore the operation monitoring module 10 is also used to collect historical fault data and historical operating status data when establishing an alarm prediction model, where the historical fault data may include: fault types or alarm names that have occurred during the historical operation of the system, etc. .

所述运行监控模块10，还用于根据所述历史故障数据以及所述历史运行状态数据进行训练，建立告警预测模型。The operation monitoring module 10 is also used to perform training based on the historical fault data and the historical operating status data to establish an alarm prediction model.

由于云服务系统上集成了种类繁多的应用，因而该系统也将接收到种类繁多且数量巨大的故障数据，为了能更好的加深运维人员和测试人员对故障数据的分析，以便可以基于该些历史故障数据构建告警预测模型。当得到历史故障数据后，需要对采集到的历史故障数据进行分类管理，通过对历史故障数据进行筛选过滤，归一化处理，统计分析，并进行打标签分类后，可以得到包含故障(或者告警)类型以及故障告警特征量的故障告警知识库。进而后续可以根据故障告警知识库对采集到的历史故障数据进行过预处理，以得到训练样本集。Since a wide variety of applications are integrated on the cloud service system, the system will also receive a wide variety and a huge amount of fault data. In order to better deepen the analysis of fault data by operation and maintenance personnel and testers, so that it can be based on the Build an alarm prediction model based on some historical fault data. After obtaining the historical fault data, it is necessary to classify and manage the collected historical fault data. By filtering, normalizing, statistical analyzing, and labeling the historical fault data, we can obtain the fault (or alarm) data. ) types and fault alarm characteristic quantities. Subsequently, the collected historical fault data can be preprocessed according to the fault alarm knowledge base to obtain a training sample set.

在本方案中，可以利用预先采集到的该系统历史故障(告警)数据以及历史运行状态数据作为训练样本，利用长短期记忆网络(Long Short-Term Memory，LSTM)预先训练告警预测模型，从而建立告警预测模型。In this solution, the pre-collected historical fault (alarm) data and historical operating status data of the system can be used as training samples, and the long short-term memory network (Long Short-Term Memory, LSTM) can be used to pre-train the alarm prediction model to establish Alarm prediction model.

告警预测模型可以根据如下方式进行训练：首先，基于LSTM神经网络构建告警预测模型，接着将训练样本集中的历史运行参数以及对应的故障类型输入该LSTM神经网络进行训练，输出预测故障类型，通过对预测故障类型与实际故障类型进行评估得到损失值，基于损失值判断模型训练效果是否理想，当最终得到的损失值小于预设阈值时，则完成训练，得到符合要求的故障预测模型。The alarm prediction model can be trained as follows: first, build an alarm prediction model based on the LSTM neural network, and then input the historical operating parameters and corresponding fault types in the training sample set into the LSTM neural network for training, and output the predicted fault type. The predicted fault type and the actual fault type are evaluated to obtain the loss value. Based on the loss value, it is judged whether the model training effect is ideal. When the final loss value is less than the preset threshold, the training is completed and a fault prediction model that meets the requirements is obtained.

所述运行监控模块10，还用于将所述运行数据输入至所述告警预测模型，得到系统的当前运行状态预测值；所述运行监控模块10，还用于将所述当前运行状态预测值与预设告警阈值进行比较，并在所述当前运行状态预测值大于所述预设告警阈值时，确定系统需要进行自动化运维。The operation monitoring module 10 is also used to input the operation data into the alarm prediction model to obtain the current operating status prediction value of the system; the operation monitoring module 10 is also used to input the current operating status prediction value to Compare with the preset alarm threshold, and when the current operating status prediction value is greater than the preset alarm threshold, it is determined that the system needs to perform automated operation and maintenance.

需要说明的是，当建立了告警预测模型后，可基于告警预测模型将当前监测周期采集到的运行数据对系统当前运行状态进行预测，通过将运行数据输入至告警预测模型，得到系统的当前运行状态预测值。It should be noted that after the alarm prediction model is established, the current operating status of the system can be predicted based on the operating data collected during the current monitoring cycle based on the alarm prediction model. By inputting the operating data into the alarm prediction model, the current operation of the system can be obtained. Status prediction value.

预设告警阈值可根据实际需求进行设置，预设告警阈值反映了系统运行的临界状态，在当前运行状态预测值大于预设告警阈值时，系统当前存在异常，需要告警处理，即需要进行自动化运维。当根据预测结果确定系统当前存在异常需要告警处理时，运行监控模块10会向运维模块20发出告警信息，并同时输入该告警信息所对应的故障类型。The preset alarm threshold can be set according to actual needs. The preset alarm threshold reflects the critical state of system operation. When the predicted value of the current operating status is greater than the preset alarm threshold, the system is currently abnormal and requires alarm processing, that is, automated operation is required. dimension. When it is determined based on the prediction results that there is currently an abnormality in the system that requires alarm processing, the operation monitoring module 10 will send alarm information to the operation and maintenance module 20 and input the fault type corresponding to the alarm information at the same time.

在具体实施中，所述运行监控模块10，还用于在所述当前运行状态预测值小于等于所述预设告警阈值时，确定系统不需要进行自动化运维。In a specific implementation, the operation monitoring module 10 is also used to determine that the system does not need to perform automated operation and maintenance when the current operating status prediction value is less than or equal to the preset alarm threshold.

在当前运行状态预测值小于等于所述预设告警阈值时，系统不存在异常，不需要进行自动化运维，可继续通过运行监控模块10对系统进行监测。When the current operating status prediction value is less than or equal to the preset alarm threshold, there is no abnormality in the system, and automated operation and maintenance is not required, and the system can continue to be monitored through the operation monitoring module 10 .

本实施例中所述运行监控模块10，还用于采集系统的历史故障数据以及历史运行状态数据；所述运行监控模块10，还用于根据所述历史故障数据以及所述历史运行状态数据进行训练，建立告警预测模型；所述运行监控模块10，还用于将所述运行数据输入至所述告警预测模型，得到系统的当前运行状态预测值；所述运行监控模块10，还用于将所述当前运行状态预测值与预设告警阈值进行比较，并在所述当前运行状态预测值大于所述预设告警阈值时，确定系统需要进行自动化运维；所述运行监控模块10，还用于在所述当前运行状态预测值小于等于所述预设告警阈值时，确定系统不需要进行自动化运维，通过将运行数据输入至建立的告警预测模型，从而快速确定系统的当前运行状态预测值，从而进一步判断系统是否需要进行自动化运维，提高实现运维自动化。The operation monitoring module 10 in this embodiment is also used to collect historical fault data and historical operating status data of the system; the operation monitoring module 10 is also used to perform monitoring according to the historical fault data and the historical operating status data. training to establish an alarm prediction model; the operation monitoring module 10 is also used to input the operation data into the alarm prediction model to obtain the current operating status prediction value of the system; the operation monitoring module 10 is also used to The current operating status prediction value is compared with the preset alarm threshold, and when the current operating status prediction value is greater than the preset alarm threshold, it is determined that the system needs to perform automated operation and maintenance; the operation monitoring module 10 also uses When the predicted value of the current operating state is less than or equal to the preset alarm threshold, it is determined that the system does not need to perform automated operation and maintenance, and the operating data is input into the established alarm prediction model to quickly determine the predicted value of the current operating state of the system. , thereby further determining whether the system needs automated operation and maintenance, and improving the automation of operation and maintenance.

本发明实施例提供了一种自动化运维方法，参照图6，图6为本发明自动化运维方法第一实施例的流程示意图。所述自动化运维方法应用于上文所述的自动化运维平台。An embodiment of the present invention provides an automated operation and maintenance method. Refer to FIG. 6 , which is a schematic flow chart of the first embodiment of the automated operation and maintenance method of the present invention. The automated operation and maintenance method is applied to the automated operation and maintenance platform described above.

本实施例中，所述自动化运维方法包括以下步骤：In this embodiment, the automated operation and maintenance method includes the following steps:

步骤S10：运行监控模块采集系统的运行数据，并对所述运行数据进行分析，得到系统的当前运行状态，并根据所述当前运行状态确定系统是否需要进行自动化运维。Step S10: The operation monitoring module collects the operation data of the system, analyzes the operation data, obtains the current operation status of the system, and determines whether the system requires automated operation and maintenance based on the current operation status.

运行监控模块主要用于采集平台或者系统的运行数据，运行数据中包括有平台或系统的运行状态、业务数据以及服务数据等相关运行参数，并将采集到的运行数据进行分析，得到系统的当前运行状态。系统的运行状态数据可以包括该系统中各应用模块的异常监测数据以及该系统中各个应用模块的使用率数据。The operation monitoring module is mainly used to collect the operation data of the platform or system. The operation data includes the operation status, business data and service data of the platform or system and other related operation parameters. It analyzes the collected operation data to obtain the current status of the system. Operating status. The operating status data of the system may include abnormal monitoring data of each application module in the system and usage data of each application module in the system.

当运行监控模块得到系统的当前运行状态后，可基于当前运行状态确定系统是否需要进行自动化运维，通过对当前运行状态进行风险预测以及业务告警，例如是否有突发事件或业务故障等，从而确定系统是否需要进行自动化运维。When the operation monitoring module obtains the current operating status of the system, it can determine whether the system requires automated operation and maintenance based on the current operating status. It can perform risk prediction and business alarms on the current operating status, such as whether there are emergencies or business failures, etc. Determine whether the system requires automated operation and maintenance.

步骤S20：运维模块在所述系统需要进行自动化运维时，根据所述当前运行状态从预设运维知识库中匹配对应的故障处理模板，并将所述故障处理模板发送至脚本发布模块。Step S20: When the system needs to perform automated operation and maintenance, the operation and maintenance module matches the corresponding fault processing template from the preset operation and maintenance knowledge base according to the current operating status, and sends the fault processing template to the script publishing module .

应理解的是，当系统需要进行自动化运维时，系统的当前运行状态为异常运行状态，运行监控模块可输出系统的告警信息等数据，并将此告警信息发送至运维模块，运维模块主要用于根据运行监控模块输出的告警信息，在预设运维知识库中匹配对应的故障处理模板，从而将故障处理模板发送至脚本发布模块。It should be understood that when the system requires automated operation and maintenance, the current operating state of the system is an abnormal operating state, and the operation monitoring module can output the alarm information and other data of the system, and send this alarm information to the operation and maintenance module. The operation and maintenance module It is mainly used to match the corresponding fault processing template in the preset operation and maintenance knowledge base based on the alarm information output by the operation monitoring module, thereby sending the fault processing template to the script publishing module.

进一步地，当接收到运行监控模块输出的当前运行状态中的告警信息后，可通过语义分析确定该告警信息所对应的故障类型，从而可以在预设运维知识库中检索对应的故障处理模板，并将检索到的故障处理模板发送至脚本发布模块。Further, after receiving the alarm information in the current operating state output by the operation monitoring module, the fault type corresponding to the alarm information can be determined through semantic analysis, so that the corresponding fault processing template can be retrieved in the preset operation and maintenance knowledge base. , and sends the retrieved fault handling template to the script publishing module.

步骤S30：所述脚本发布模块根据所述故障处理模板配置对应的运维脚本，并将所述运维脚本进行发布加载。Step S30: The script publishing module configures the corresponding operation and maintenance script according to the fault processing template, and publishes and loads the operation and maintenance script.

需要说明的是，脚本发布模块可根据故障处理模板配置对应的运维脚本，从而将运维脚本进行发布加载。It should be noted that the script publishing module can configure the corresponding operation and maintenance script according to the fault handling template, thereby publishing and loading the operation and maintenance script.

应理解的是，在进行运维脚本配置之前，脚本发布模块还会在已有的预设运维脚本库中查找是否存在与当前的故障类型匹配的运维脚本，当查询到存在相匹配的运维脚本时，执行该运维脚本，以实现对故障的处理。It should be understood that before configuring the operation and maintenance script, the script publishing module will also search in the existing preset operation and maintenance script library to see if there is an operation and maintenance script that matches the current fault type. When the query finds that there is a matching operation and maintenance script, When running an operation and maintenance script, execute the operation and maintenance script to handle faults.

如图7所示，图7为自动化运维方法整体流程示意图，步骤1：通过运行监控模块对系统的运行状态、业务数据以及服务数据等运行相关数据进行监测采集；步骤2：运行监控模块对运行相关数据进行分析，确定系统当前运行状态，并根据系统当前运行状态，判断是否需要进行自动化运维；步骤3：运维模块根据告警提示中所携带的故障类型，在预先构建的运维知识库中进行故障匹配，确定对应的故障处理模板；步骤4：脚本发布模块根据通过步骤3获得的故障处理模板，配置运维脚本，并将运维脚本进行发布加载；步骤5：脚本发布模块会按照预设周期，对运维脚本执行后系统的状态信息进行监控，判断运维脚本执行后系统状态与期望的系统状态是否一致，当运维脚本执行后系统状态与期望的系统状态一致时，继续对运维脚本执行后系统的状态信息进行监控，当运维脚本执行后系统状态与期望的系统状态不一致时，步骤6：确定当前系统状态值与预测系统状态值之间的差值，基于该差值，对当前运维脚本进行更新。As shown in Figure 7, Figure 7 is a schematic diagram of the overall process of the automated operation and maintenance method. Step 1: Monitor and collect operation-related data such as the system's operating status, business data, and service data through the operation monitoring module; Step 2: The operation monitoring module monitors and collects Analyze operation-related data to determine the current operating status of the system, and determine whether automated operation and maintenance is required based on the current operating status of the system; Step 3: The operation and maintenance module uses the pre-built operation and maintenance knowledge based on the fault type carried in the alarm prompt. Perform fault matching in the library to determine the corresponding fault handling template; Step 4: The script publishing module configures the operation and maintenance script according to the fault handling template obtained through step 3, and publishes and loads the operation and maintenance script; Step 5: The script publishing module will According to the preset cycle, monitor the status information of the system after the execution of the operation and maintenance script, and determine whether the system status after the execution of the operation and maintenance script is consistent with the expected system status. When the system status after the execution of the operation and maintenance script is consistent with the expected system status, Continue to monitor the system status information after the operation and maintenance script is executed. When the system status after the operation and maintenance script is executed is inconsistent with the expected system status, step 6: Determine the difference between the current system status value and the predicted system status value, based on This difference will update the current operation and maintenance script.

本实施例通过运行监控模块采集系统的运行数据，并对所述运行数据进行分析，得到系统的当前运行状态，并根据所述当前运行状态确定系统是否需要进行自动化运维；运维模块在所述系统需要进行自动化运维时，根据所述当前运行状态从预设运维知识库中匹配对应的故障处理模板，并将所述故障处理模板发送至脚本发布模块；所述脚本发布模块根据所述故障处理模板配置对应的运维脚本，并将所述运维脚本进行发布加载，从而实现运维脚本根据实际需要而自动进行升级迭代，实现运维自动化，提高运维效率，同时不需要对脚本库进行人工的更新迭代，极大地节约了运营成本。This embodiment collects the operating data of the system through the operation monitoring module, analyzes the operating data, obtains the current operating status of the system, and determines whether the system requires automated operation and maintenance based on the current operating status; the operation and maintenance module is located where When the system needs to perform automated operation and maintenance, the corresponding fault processing template is matched from the preset operation and maintenance knowledge base according to the current operating status, and the fault processing template is sent to the script publishing module; the script publishing module Configure the corresponding operation and maintenance script in the above fault handling template, and publish and load the operation and maintenance script, so that the operation and maintenance script can be automatically upgraded and iterated according to actual needs, realize operation and maintenance automation, improve operation and maintenance efficiency, and do not need to modify the operation and maintenance script. The script library performs manual update and iteration, which greatly saves operating costs.

在一些实施例中，所述脚本发布模块包括：自动脚本编写发布子模块，所述脚本发布模块根据所述故障处理模板配置对应的运维脚本，并将所述运维脚本进行发布加载的步骤具体包括：所述自动脚本编写发布子模块根据所述故障处理模板在预设运维脚本库中查询与故障类型匹配的运维脚本，并在查询到与所述故障类型匹配的运维脚本时，将所述运维脚本进行发布加载；所述自动脚本编写发布子模块在未查询为与所述故障类型匹配的运维脚本时，根据所述故障处理模板自动配置生成运维脚本，并对所述运维脚本进行审核以及发布加载。In some embodiments, the script publishing module includes: an automatic script writing and publishing sub-module, the script publishing module configures the corresponding operation and maintenance script according to the fault handling template, and publishes and loads the operation and maintenance script. Specifically: the automatic script writing and publishing sub-module queries the preset operation and maintenance script library for operation and maintenance scripts that match the fault type according to the fault processing template, and when the operation and maintenance script that matches the fault type is queried, , publish and load the operation and maintenance script; when the automatic script writing and publishing sub-module does not query the operation and maintenance script that matches the fault type, it automatically configures and generates the operation and maintenance script according to the fault handling template, and The operation and maintenance script is reviewed and released for loading.

在一些实施例中，所述根据所述故障处理模板自动配置生成运维脚本，并对所述运维脚本进行审核以及发布加载的步骤具体包括：所述自动脚本编写发布子模块根据所述故障处理模板确定故障处理方法，并根据所述故障处理方法确定对应的故障处理代码结构；所述自动脚本编写发布子模块根据所述故障处理代码结构确定对应的代码参数，并将所述代码参数填入所述故障处理代码结构中，按照预设的脚本生成规则生成运维脚本。In some embodiments, the steps of automatically configuring and generating an operation and maintenance script according to the fault handling template, and reviewing and publishing and loading the operation and maintenance script specifically include: the automatic script writing and publishing sub-module according to the fault The processing template determines the fault processing method, and determines the corresponding fault processing code structure according to the fault processing method; the automatic script writing and publishing sub-module determines the corresponding code parameters according to the fault processing code structure, and fills in the code parameters. into the fault handling code structure, and generate operation and maintenance scripts according to preset script generation rules.

在一些实施例中，所述脚本发布模块还包括：脚本维护子模块以及脚本更新管理子模块；所述脚本发布模块根据所述故障处理模板配置对应的运维脚本，并将所述运维脚本进行发布加载之后，还包括：所述脚本维护子模块预设周期对当前运维脚本执行后系统的当前状态信息进行监控，检测所述当前状态信息与预设系统状态是否一致；所述脚本维护子模块在所述当前状态信息与预设系统状态不一致时，生成更新信息至所述脚本更新管理子模块；所述脚本更新管理子模块基于所述更新信息根据所述当前状态信息与所述预设系统状态得到状态差值，并根据所述状态差值对当前运维脚本进行更新。In some embodiments, the script publishing module also includes: a script maintenance sub-module and a script update management sub-module; the script publishing module configures the corresponding operation and maintenance script according to the fault handling template, and sends the operation and maintenance script to After publishing and loading, it also includes: the script maintenance sub-module monitors the current status information of the system after the current operation and maintenance script is executed at a preset period, and detects whether the current status information is consistent with the preset system status; the script maintenance The sub-module generates update information to the script update management sub-module when the current status information is inconsistent with the preset system status; the script update management sub-module generates update information based on the update information according to the current status information and the preset system status. It is assumed that the system status obtains the status difference value, and the current operation and maintenance script is updated according to the status difference value.

在一些实施例中，所述脚本维护子模块预设周期对当前运维脚本执行后系统的当前状态信息进行监控，检测所述当前状态信息与预设系统状态是否一致之前，还包括：所述脚本维护子模块获取运维脚本运行状态监测模型，并根据所述当前运行状态得到故障信息；所述脚本维护子模块将所述故障信息以及当前运维脚本输入至所述运维脚本运行状态监测模型，得到执行当前运维脚本后的预设系统状态。In some embodiments, the script maintenance sub-module monitors the current status information of the system after the current operation and maintenance script is executed at a preset period, and before detecting whether the current status information is consistent with the preset system status, the method further includes: The script maintenance sub-module obtains the operation and maintenance script running status monitoring model, and obtains fault information according to the current running status; the script maintenance sub-module inputs the fault information and the current operation and maintenance script to the operation and maintenance script running status monitoring model to obtain the preset system state after executing the current operation and maintenance script.

在一些实施例中，所述脚本维护子模块获取运维脚本运行状态监测模型的步骤具体包括：所述脚本维护子模块获取历史运维数据，并将所述历史运维数据进行预处理，得到历史运维数据对应的训练数据；所述脚本维护子模块根据所述训练数据进行训练，得到运维脚本运行状态监测模型。In some embodiments, the steps for the script maintenance sub-module to obtain the operation and maintenance script running status monitoring model specifically include: the script maintenance sub-module obtains historical operation and maintenance data, and preprocesses the historical operation and maintenance data to obtain Training data corresponding to historical operation and maintenance data; the script maintenance sub-module performs training based on the training data to obtain an operation and maintenance script running status monitoring model.

在一些实施例中，所述运维模块在所述系统需要进行自动化运维时，根据所述当前运行状态从预设运维知识库中匹配对应的故障处理模板，并将所述故障处理模板发送至脚本发布模块之前，还包括：所述运维模块获取系统历史运维数据，并对所述系统历史运维数据进行预处理，得到运维数据文本集合；所述运维模块对所述运维数据文本集合进行数据关联规则挖掘，将运维数据文本集合中的历史运维数据转换为运维知识数据；所述运维模块根据所述运维知识数据建立预设运维知识库。In some embodiments, when the system needs to perform automated operation and maintenance, the operation and maintenance module matches the corresponding fault processing template from the preset operation and maintenance knowledge base according to the current operating status, and adds the fault processing template to the preset operation and maintenance knowledge base. Before sending to the script publishing module, it also includes: the operation and maintenance module obtains the system historical operation and maintenance data, and preprocesses the system historical operation and maintenance data to obtain the operation and maintenance data text set; the operation and maintenance module The operation and maintenance data text collection performs data association rule mining and converts the historical operation and maintenance data in the operation and maintenance data text collection into operation and maintenance knowledge data; the operation and maintenance module establishes a preset operation and maintenance knowledge base based on the operation and maintenance knowledge data.

在一些实施例中，所述运维模块根据所述运维知识数据建立预设运维知识库之后，还包括：所述运维模块在所述预设运维知识库中建立搜索引擎，并将运维知识数据存储至预设运维知识库中；所述根据所述当前运行状态从预设运维知识库中匹配对应的故障处理模板，并将所述故障处理模板发送至脚本发布模块的步骤具体包括：所述运维模块在根据所述运行状态得到告警信息，并通过语义分析确定所述告警信息对应的故障类型；所述运维模块根据所述故障类型在所述预设运维知识库中匹配与所述故障类型对应的故障处理模板，并将所述故障处理模板发送至所述脚本发布模块。In some embodiments, after the operation and maintenance module establishes a preset operation and maintenance knowledge base according to the operation and maintenance knowledge data, it further includes: the operation and maintenance module establishes a search engine in the preset operation and maintenance knowledge base, and Store the operation and maintenance knowledge data in the preset operation and maintenance knowledge base; match the corresponding fault processing template from the preset operation and maintenance knowledge base according to the current operating status, and send the fault processing template to the script publishing module The steps specifically include: the operation and maintenance module obtains alarm information according to the operating status, and determines the fault type corresponding to the alarm information through semantic analysis; the operation and maintenance module obtains alarm information according to the fault type according to the preset operation status. Match the fault processing template corresponding to the fault type in the dimension knowledge base, and send the fault processing template to the script publishing module.

在一些实施例中，所述运行监控模块采集系统的运行数据，并对所述运行数据进行分析，得到系统的当前运行状态，并根据所述当前运行状态确定系统是否需要进行自动化运维的步骤具体包括：所述运行监控模块采集系统的历史故障数据以及历史运行状态数据；所述运行监控模块根据所述历史故障数据以及所述历史运行状态数据进行训练，建立告警预测模型；所述运行监控模块将所述运行数据输入至所述告警预测模型，得到系统的当前运行状态预测值；所述运行监控模块将所述当前运行状态预测值与预设告警阈值进行比较，并在所述当前运行状态预测值大于所述预设告警阈值时，确定系统需要进行自动化运维；所述运行监控模块在所述当前运行状态预测值小于等于所述预设告警阈值时，确定系统不需要进行自动化运维。In some embodiments, the operation monitoring module collects the operation data of the system, analyzes the operation data, obtains the current operation status of the system, and determines whether the system requires automated operation and maintenance based on the current operation status. Specifically, it includes: the operation monitoring module collects historical fault data and historical operation status data of the system; the operation monitoring module performs training based on the historical fault data and the historical operation status data, and establishes an alarm prediction model; the operation monitoring module The module inputs the operating data into the alarm prediction model to obtain the current operating status prediction value of the system; the operation monitoring module compares the current operating status prediction value with the preset alarm threshold and performs When the predicted status value is greater than the preset alarm threshold, it is determined that the system needs to perform automated operation and maintenance; when the predicted current operating status value is less than or equal to the preset alarm threshold, the operation monitoring module determines that the system does not need to be automated. dimension.

应当理解的是，以上仅为举例说明，对本发明的技术方案并不构成任何限定，在具体应用中，本领域的技术人员可以根据需要进行设置，本发明对此不做限制。It should be understood that the above are only examples and do not constitute any limitation on the technical solution of the present invention. In specific applications, those skilled in the art can make settings as needed, and the present invention does not impose any limitations on this.

需要说明的是，以上所描述的工作流程仅仅是示意性的，并不对本发明的保护范围构成限定，在实际应用中，本领域的技术人员可以根据实际的需要选择其中的部分或者全部来实现本实施例方案的目的，此处不做限制。It should be noted that the workflow described above is only illustrative and does not limit the scope of the present invention. In practical applications, those skilled in the art can select some or all of them for implementation according to actual needs. The purpose of this embodiment is not limited here.

另外，未在本实施例中详尽描述的技术细节，可参见本发明任意实施例所提供的自动化运维方法，此处不再赘述。In addition, for technical details that are not described in detail in this embodiment, please refer to the automated operation and maintenance method provided by any embodiment of the present invention, and will not be described again here.

此外，需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。Furthermore, it should be noted that, as used herein, the terms "include", "comprises" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or system that includes a list of elements includes not only those elements, but also other elements not expressly listed or elements inherent to the process, method, article or system. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如只读存储器(Read Only Memory，ROM)/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as a read-only memory). , ROM)/RAM, magnetic disk, optical disk), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the method described in various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the description and drawings of the present invention may be directly or indirectly used in other related technical fields. , are all similarly included in the scope of patent protection of the present invention.

Claims

1. An automated operation and maintenance platform, characterized in that the automated operation and maintenance platform comprises: the system comprises an operation monitoring module, an operation maintenance module and a script issuing module;

the operation monitoring module is used for collecting operation data of the system, analyzing the operation data to obtain the current operation state of the system, and determining whether the system needs to be subjected to automatic operation or not according to the current operation state;

the operation and maintenance module is used for matching a corresponding fault processing template from a preset operation and maintenance knowledge base according to the current operation state when the system needs to be automatically operated and maintained, and sending the fault processing template to the script issuing module;

and the script issuing module is used for configuring the corresponding operation and maintenance script according to the fault processing template and issuing and loading the operation and maintenance script.

2. The automated operation and maintenance platform of claim 1, wherein the script issuing module comprises: an automatic script writing and issuing sub-module;

the automatic script writing and issuing sub-module is used for inquiring an operation and maintenance script matched with the fault type in a preset operation and maintenance script library according to the fault processing template, and issuing and loading the operation and maintenance script when inquiring the operation and maintenance script matched with the fault type;

And the automatic script writing and issuing sub-module is also used for automatically configuring and generating an operation and maintenance script according to the fault processing template when the operation and maintenance script matched with the fault type is not inquired, and auditing, issuing and loading the operation and maintenance script.

3. The automated operation and maintenance platform according to claim 2, wherein the automated script writing and publishing sub-module is further configured to determine a fault handling method according to the fault handling template, and determine a corresponding fault handling code structure according to the fault handling method;

the automatic script writing and issuing sub-module is further used for determining corresponding code parameters according to the fault processing code structure, filling the code parameters into the fault processing code structure and generating an operation and maintenance script according to a preset script generation rule.

4. The automated operation and maintenance platform of claim 1, wherein the script issuing module further comprises: a script maintenance sub-module and a script update management sub-module;

the script maintenance submodule is used for monitoring current state information of the system after the current operation and maintenance script is executed in a preset period and detecting whether the current state information is consistent with a preset system state or not;

The script maintenance sub-module is further configured to generate update information to the script update management sub-module when the current state information is inconsistent with a preset system state;

the script updating management sub-module is used for obtaining a state difference value according to the current state information and the preset system state based on the updating information, and updating the current operation and maintenance script according to the state difference value.

5. The automated operation and maintenance platform according to claim 4, wherein the script maintenance sub-module is further configured to obtain an operation and maintenance script operation state monitoring model, and obtain fault information according to the current operation state;

the script maintenance sub-module is further configured to input the fault information and the current operation and maintenance script into the operation and maintenance script running state monitoring model, so as to obtain a preset system state after the current operation and maintenance script is executed.

6. The automated operation and maintenance platform according to claim 5, wherein the script maintenance sub-module is further configured to obtain historical operation and maintenance data, and perform preprocessing on the historical operation and maintenance data to obtain training data corresponding to the historical operation and maintenance data;

and the script maintenance sub-module is also used for training according to the training data to obtain an operation and maintenance script running state monitoring model.

7. The automated operation and maintenance platform according to claim 1, wherein the operation and maintenance module is further configured to obtain system historical operation and maintenance data, and perform preprocessing on the system historical operation and maintenance data to obtain an operation and maintenance data text set;

the operation and maintenance module is further used for carrying out data association rule mining on the operation and maintenance data text set and converting historical operation and maintenance data in the operation and maintenance data text set into operation and maintenance knowledge data;

the operation and maintenance module is also used for establishing a preset operation and maintenance knowledge base according to the operation and maintenance knowledge data.

8. The automated operation and maintenance platform according to claim 7, wherein the operation and maintenance module is further configured to build a search engine in the preset operation and maintenance knowledge base and store operation and maintenance knowledge data in the preset operation and maintenance knowledge base;

the operation and maintenance module is further used for obtaining alarm information according to the operation state and determining a fault type corresponding to the alarm information through semantic analysis;

the operation and maintenance module is further configured to match a fault handling template corresponding to the fault type in the preset operation and maintenance knowledge base according to the fault type, and send the fault handling template to the script issuing module.

9. The automated operation and maintenance platform of any one of claims 1 to 8, wherein the operation monitoring module is further configured to collect historical fault data and historical operating status data of the system;

the operation monitoring module is also used for training according to the historical fault data and the historical operation state data, and establishing an alarm prediction model;

the operation monitoring module is also used for inputting the operation data into the alarm prediction model to obtain a current operation state predicted value of the system;

the operation monitoring module is further used for comparing the current operation state predicted value with a preset alarm threshold value and determining that the system needs to perform automatic operation and maintenance when the current operation state predicted value is greater than the preset alarm threshold value;

the operation monitoring module is further configured to determine that the system does not need to perform automated operation and maintenance when the predicted value of the current operation state is less than or equal to the preset alarm threshold.

10. An automated operation and maintenance method, wherein the automated operation and maintenance method is applied to the automated operation and maintenance platform according to any one of claims 1 to 9, and the automated operation and maintenance method comprises:

The operation monitoring module collects operation data of the system, analyzes the operation data to obtain the current operation state of the system, and determines whether the system needs to be automatically operated and maintained according to the current operation state;

when the system needs to be automatically operated, the operation and maintenance module matches a corresponding fault processing template from a preset operation and maintenance knowledge base according to the current operation state and sends the fault processing template to a script issuing module;

and the script issuing module configures a corresponding operation and maintenance script according to the fault processing template, and issues and loads the operation and maintenance script.