WO2021102617A1

WO2021102617A1 - Multi-public cloud computing platform-oriented cluster monitoring system and monitoring method therefor

Info

Publication number: WO2021102617A1
Application number: PCT/CN2019/120527
Authority: WO
Inventors: 朱和胜; 林帅康; 刘阳; 马健; 温书豪
Original assignee: Shenzhen Jingtai Technology Co Ltd
Current assignee: Shenzhen Jingtai Technology Co Ltd
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2021-06-03
Anticipated expiration: 2022-05-25

Abstract

The present invention provides a multi-cloud computing platform-oriented cluster monitoring system, comprising: a data acquisition subsystem, configured to acquire basic resource data, task running state and consumption, and overall resource usage monitoring data of each cloud computing platform according to given indicators, and provide an interface for a scheduling system to call to obtain real-time monitoring data to guide scheduling; a data processing subsystem, configured to perform a series of processing on cluster monitoring data uploaded by each data subsystem by means of an RPC request and transfer same to a backend for storage, and provide an interface for data display and for a bill system; and an alarm subsystem, configured to process and analyze the monitoring data according to an alarm strategy, determine an alarm level, and send alarm information. The present invention can facilitate better viewing the overall task running condition, improve the resource utilization, make computing resources persistent to facilitate bill audit and account checking, automatically solve some abnormalities, and reduce manual participation.

Description

Multi-public cloud computing platform-oriented cluster monitoring system and monitoring method thereof

Technical field

本发明属于跨多公有云计算调度技术领域，具体涉及一种面向多公有云计算平台的集群监控系统及其监控方法，可以使用在多云计算平台集群之中，作为多计算集群资源状态，计算任务状态监控告警系统和其他相关系统的数据支撑系统。The invention belongs to the technical field of multi-public cloud computing scheduling, and specifically relates to a cluster monitoring system and a monitoring method for multi-public cloud computing platforms, which can be used in a multi-cloud computing platform cluster as a multi-computing cluster resource state, computing Task status monitoring alarm system and data support system of other related systems.

Background technique

云计算是一种按使用量付费的模式，这种模式提供可用的、便捷的、按需的网络访问，进入可配置的计算资源共享池(资源包括网络、服务器、存储、应用软件、服务)，这些资源能够被快速提供，只需投入很少的管理工作，或与服务供应商进行很少的交互。Cloud computing is a pay-per-use model that provides usable, convenient, and on-demand network access and enters a configurable computing resource sharing pool (resources include networks, servers, storage, application software, and services) , These resources can be provided quickly, with little management effort or little interaction with service providers.

随着互联网和云计算等技术的发展，越来越多的共有云计算资源可供选择，基于计算成本和区域化的考虑，跨多公有云的计算任务调度已经成为一种趋势。与此同时，云计算本身的复杂度也在加剧，鉴于成本和复杂度的考虑，公有云提供商无法提供完备的对各个计算资源的监控指标和监控数据，所以用户只能基于有限的监控数据做出决策。With the development of technologies such as the Internet and cloud computing, more and more shared cloud computing resources are available for selection. Based on computing cost and regionalization considerations, computing task scheduling across multiple public clouds has become a trend. At the same time, the complexity of cloud computing itself is also increasing. In view of cost and complexity considerations, public cloud providers cannot provide complete monitoring indicators and monitoring data for each computing resource, so users can only rely on limited monitoring data. Make a decision.

每个云计算提供商会提供可视化的资源监控，供用户对一定时期的资源使用情况有一个大致的掌握以及提供根据资源数据设定阈值进而进行告警动作。有效的监控系统可以敏捷的对云计算平台进行管理，进而在云平台和用户计算平台不断迭代的情况下保证整个调度计算流程的可用性和安全性。Each cloud computing provider will provide visual resource monitoring for users to have a rough grasp of resource usage in a certain period of time and provide thresholds based on resource data to perform alarm actions. An effective monitoring system can manage the cloud computing platform agilely, thereby ensuring the availability and security of the entire scheduling computing process under the continuous iteration of the cloud platform and user computing platform.

目前的公有云监控系统存在的问题主要在以下几个方面：The current public cloud monitoring system has problems mainly in the following aspects:

1，云计算提供商上只提供计算资源的基本监控指标及其监控数据。计算平台是以大规模的计算密集型的任务为主，会大量的使用CPU资源进行计算，云提供商仅仅提供计算节点的CPU，内存，网络等基础的资源监控，但是这些基础监控数据还不足以满足计算平台的需求。目前市面上的计算平台都是以kubernetes或mesos为基础的，需要监控平台上每一个计算任务的实时运行状态以及任务的资源请求量和实际使用量等监控指标。目前云提供商都无法完成对这些监控指标的支持。1. Cloud computing providers only provide basic monitoring indicators and monitoring data of computing resources. The computing platform is mainly based on large-scale computing-intensive tasks, which will use a large amount of CPU resources for computing. Cloud providers only provide basic resource monitoring such as CPU, memory, and network of computing nodes, but these basic monitoring data are not enough. To meet the needs of computing platforms. Currently, computing platforms on the market are based on kubernetes or mesos, and need to monitor the real-time running status of each computing task on the platform, as well as monitoring indicators such as task resource requests and actual usage. Currently, cloud providers are unable to complete the support for these monitoring indicators.

2，基于成本的考虑，云提供商无法自定义监控指标采集且无法对监控数据进行更详细的分析，进而无法反馈调度系统实时进行调整策略。在1中说明了云提供商无法提供一些云计算平台需要的监控指标而且云提供商也无法让用户通过一定的方式采集这些指标并进行展示。对于已有的监控数据存储在云端之上且没有提供方式获取一定时期的历史数据进行更详细的分析，进而无法核对资源使用情况造成无法对资源使用账单进行核对。若无法获取调度系统进行策略调整所依赖的数据，实时的调度策略也就无从谈起，单一的调度策略可能会造成资源不必要的浪费。2. Based on cost considerations, cloud providers cannot customize the collection of monitoring indicators and cannot perform more detailed analysis of monitoring data, and thus cannot feed back the scheduling system to adjust strategies in real time. In 1 it is explained that cloud providers cannot provide some monitoring indicators required by cloud computing platforms, and cloud providers cannot allow users to collect and display these indicators in a certain way. The existing monitoring data is stored on the cloud and there is no way to obtain historical data for a certain period of time for more detailed analysis, and then the resource usage cannot be checked, which makes it impossible to check the resource usage bill. If the data that the scheduling system relies on for policy adjustment cannot be obtained, there is no real-time scheduling strategy, and a single scheduling strategy may cause unnecessary waste of resources.

3，自带的资源监控视图分散，无法提供用户可自定义的统一视图，而且各个云提供商的监控方式互不相同，很难有统一的方式去处理各个云的资源监控数据。与传统的运维监控系统不同，计算平台更关心某个集群整体的运行状态而不是某台机器的基础资源状态，比如集群整体的资源分配率，资源请求率以及计算任务的消耗速度等等。云提供不但无法提供某些指标的采集而且连基础资源的整合都很难做到，这也就无法提供出一个直观的资源状态视图。3. The built-in resource monitoring views are scattered and cannot provide a user-customizable unified view. Moreover, the monitoring methods of various cloud providers are different from each other. It is difficult to have a unified way to process the resource monitoring data of each cloud. Different from the traditional operation and maintenance monitoring system, the computing platform is more concerned with the overall operating status of a certain cluster rather than the basic resource status of a certain machine, such as the overall resource allocation rate of the cluster, the resource request rate, and the consumption speed of computing tasks. Cloud provides not only the collection of certain indicators but also the integration of basic resources is difficult to achieve, which also cannot provide an intuitive view of the resource status.

4，告警系统只能针对某一个或某一类资源单独设置且无法对告警进行分级。云提供商可以提供基础资源的阈值告警，但是无法对这些告警进行分级，在告警信息过多的时候很容易造成关键告警信息被忽略无法及时处理。而且现有的告警策略完全不能满足云计算平台的告警需求，从而无法完全掌握计算平台的运行时状态。4. The alarm system can only be set separately for a certain resource or a certain type of resource and cannot classify the alarm. Cloud providers can provide threshold alarms for basic resources, but cannot classify these alarms. When there are too many alarm information, it is easy to cause critical alarm information to be ignored and cannot be processed in a timely manner. In addition, the existing alarm strategy cannot meet the alarm requirements of the cloud computing platform at all, and thus cannot fully grasp the runtime state of the computing platform.

发明概述Summary of the invention

technical problem

问题的解决方案The solution to the problem

Technical solutions

针对上述技术问题，本发明提供一种面向多公有云计算平台的集群监控系统及其监控方法，实现对多公有云上计算集群的监控数据的获取，支持计算平台依赖的监控指标的收集以及监控数据的分析，转储和对调度系统和对账系统的反馈；支持自定义的警告分级以及不同的告警方式；支持各云监控数据汇聚的资源监控视图。In view of the above technical problems, the present invention provides a cluster monitoring system for multiple public cloud computing platforms and a monitoring method thereof, which realizes the acquisition of monitoring data of computing clusters on multiple public clouds, and supports the collection and monitoring of monitoring indicators that the computing platform depends on. Data analysis, dumping, and feedback to the scheduling system and reconciliation system; support custom warning classification and different alarm methods; support resource monitoring views for the aggregation of various cloud monitoring data.

具体技术方案如下：The specific technical solutions are as follows:

向多云计算平台的集群监控系统，包括三个子系统：The cluster monitoring system of the multi-cloud computing platform includes three subsystems:

数据采集子系统，负责根据既定指标采集各个云上计算平台的基础资源数据、任务运行状态和消耗、整体资源使用状态等监控数据，并且提供接口供调度系统调用获取实时监控数据指导调度；The data collection subsystem is responsible for collecting basic resource data, task operation status and consumption, overall resource usage status and other monitoring data of each cloud computing platform according to the established indicators, and provides an interface for the scheduling system to call to obtain real-time monitoring data to guide scheduling;

数据处理子系统，负责各个数据子系统通过RPC请求上传的集群监控数据，进行一系列的处理转储到后端进行存储，并且提供接口供数据展示、账单系统等其他功能使用；The data processing subsystem is responsible for the cluster monitoring data uploaded by each data subsystem through RPC request, performs a series of processing and dumps it to the back-end for storage, and provides interfaces for data display, billing system and other functions;

告警子系统，负责根据告警策略处理分析监控数据，确认告警级别并且发送告警信息，对于以下低级别的告警，子系统可以根据预置的方式自行处理和恢复。The alarm subsystem is responsible for processing and analyzing monitoring data according to the alarm strategy, confirming the alarm level and sending alarm information. For the following low-level alarms, the subsystem can process and recover by itself according to the preset method.

数据采集子系统可以根据待采集数据的类型分为三个模块，分别是：The data acquisition subsystem can be divided into three modules according to the type of data to be collected, namely:

集群节点信息采集模块，由于调度系统分发计算任务的随机性，需要配合任务数量的多少监控是否有足够的机器满足任务调度，实时的观测每个集群的节点数目就十分有必要。此系统负责采集各个云计算提供商上的计算集群节点的实际节点数目，计划节点数目以及最大支持节点数目。在这个模块上实现了统一的应用接口层，用来对接各个云提供商的基础监控接口；In the cluster node information collection module, due to the randomness of the scheduling system to distribute computing tasks, it needs to cooperate with the number of tasks to monitor whether there are enough machines to meet the task scheduling. It is very necessary to observe the number of nodes in each cluster in real time. This system is responsible for collecting the actual number of nodes in the computing cluster nodes on each cloud computing provider, the planned number of nodes, and the maximum number of supported nodes. A unified application interface layer is implemented on this module to interface with the basic monitoring interfaces of various cloud providers;

集群计算资源信息采集模块，集群所运行的都是计算密集型的任务，提高CPU使用率可以极大的节约计算成本，此模块负责采集各个计算集群的CPU信息，包括总共CPU数量，任务请求使用的CPU数量，任务实际使用的CPU数量。在这个模块上通过在集群中部署第三方插件(Heapster，Metrics-Server，Prometheus)，针对这些插件实现了统一的资源接口封装，可以实时获取任务请求使用的CPU数量，任务实际使用的CPU数量。对于总共CPU数量，我们根据各个云的情况分别实现了获取接口；Cluster computing resource information collection module. The cluster runs computationally intensive tasks. Increasing the CPU usage can greatly save computing costs. This module is responsible for collecting the CPU information of each computing cluster, including the total number of CPUs and task request usage. The number of CPUs, the number of CPUs actually used by the task. On this module, by deploying third-party plug-ins (Heapster, Metrics-Server, Prometheus) in the cluster, a unified resource interface encapsulation is implemented for these plug-ins, and the number of CPUs used by task requests and the number of CPUs actually used by tasks can be obtained in real time. For the total number of CPUs, we implemented the acquisition interface according to the situation of each cloud;

计算任务状态采集模块，此模块实时采集集群中的所有任务信息，通过一个分类子模块对任务进行分类(比如Running，Pending，Evicted，ImagePull，PodInitializing等等)，这些数据将为告警系统提供强大的数据支持。Calculating task status collection module, this module collects all task information in the cluster in real time, and classifies tasks through a classification sub-module (such as Running, Pending, Evicted, ImagePull, PodInitializing, etc.). These data will provide a powerful alarm system data support.

另外，数据采集子系统会提供一个集群状态定义功能，我们分析整个集群计算过程中会出现的各种状态(比如扩容状态、缩容状态、稳定状态、满负荷状态等等)，这些状态能够更直观的反映出集群目前所处的运行时情况，而且会根据调度系统的需求形成一份指导调度的监控信息。In addition, the data collection subsystem will provide a cluster state definition function. We analyze the various states (such as expansion state, shrinking state, stable state, full load state, etc.) that will appear during the entire cluster calculation process. These states can be more It intuitively reflects the current runtime situation of the cluster, and will form a monitoring information to guide the scheduling according to the needs of the scheduling system.

数据处理子系统可以根据功能不同分为三个模块，分别是：The data processing subsystem can be divided into three modules according to different functions, namely:

监控信息汇聚模块，负责处理数据采集子系统上传的监控数据，把各个云上的集群信息，任务运行信息进行一系列的处理分析，以及做不同程度的汇聚然后转储，供做展示和审计使用；The monitoring information aggregation module is responsible for processing the monitoring data uploaded by the data collection subsystem, performing a series of processing and analysis on the cluster information and task operation information on each cloud, and doing different levels of aggregation and then dumping for display and auditing. ；

账单信息处理模块，根据账单方面的要求，处理成账单分析所需要的数据并每分钟存储到时序数据库；The bill information processing module, according to the requirements of the bill, processes the data required for bill analysis and stores it in the time series database every minute;

任务转移模块，负责根据各个集群的监控信息以及调度系统的任务信息自动化的转移任务到负荷小的集群，从而缩短任务排队等待的时间和提高计算资源利用率。The task transfer module is responsible for automatically transferring tasks to clusters with low load based on the monitoring information of each cluster and the task information of the scheduling system, thereby shortening the waiting time of tasks and improving the utilization of computing resources.

告警子系统可以根据告警类型不同分为三个模块，分别是：The alarm subsystem can be divided into three modules according to different alarm types, namely:

告警策略处理模块，负责实现告警策略的数据处理逻辑，不同的告警策略会有不同的处理逻辑；The alarm strategy processing module is responsible for implementing the data processing logic of the alarm strategy. Different alarm strategies will have different processing logic;

集群资源使用率告警模块，负责执行集群信息相关的告警策略逻辑，对策略处理结果进行分类，然后根据告警严重程度选择不用的渠道发送告警信息；The cluster resource usage alarm module is responsible for executing the alarm strategy logic related to the cluster information, classifying the strategy processing results, and then selecting different channels to send the alarm information according to the severity of the alarm;

计算任务运行状态告警模块，负责执行计算任务相关的告警策略，由于任务状态的多样性，此模块会对各种状态检测的优先级进行定义并且根据优先级进行顺序检测，并对异常状态的任务进行发送告警信息处理。与此同时，此模块对于一些低级别的异常定义了相对的解决方案，会在发送告警信息的同时执行解决方案。The computing task running status alarm module is responsible for executing the alarm strategies related to the computing task. Due to the diversity of task status, this module will define the priority of various status detections and perform sequential detection according to the priority, and perform tasks in abnormal conditions. Carry out sending alarm information processing. At the same time, this module defines relative solutions for some low-level exceptions, and will execute the solutions while sending alarm information.

该面向多公有云计算平台的集群监控系统的监控方法，具体的，是由三个子系统协调工作的，下面针对每个子系统的步骤予以说明：The monitoring method of the cluster monitoring system for multiple public cloud computing platforms is specifically coordinated by three subsystems. The steps for each subsystem are described below:

数据采集子系统是分布式的运行在云提供商的计算集群之上，具体步骤如下：The data collection subsystem is distributed and runs on the computing cluster of the cloud provider. The specific steps are as follows:

(1)通过统一的接口层函数，对集群节点信息，集群资源信息，计算任务状态信息的采集是同步进行的，下面分别进行说明：(1) Through a unified interface layer function, the collection of cluster node information, cluster resource information, and computing task status information is performed synchronously, which are described below:

(1.1)集群节点信息采集模块，首先本模块统一接口层完成对云提供商的接入，再使用统一接口层获取到集群节点信息，即实际节点数目，计划节点数目以及最大支持节点数目；最后入内存临时保存。(1.1) Cluster node information collection module. First, the unified interface layer of this module completes the access to the cloud provider, and then uses the unified interface layer to obtain cluster node information, that is, the actual number of nodes, the planned number of nodes, and the maximum number of supported nodes; and finally Into the memory for temporary storage.

(1.2)集群计算资源信息采集模块，首先同样本模块统一接口层完成了对集群资源收集插件的封装，支持Heapster、Metrics-Server、Prometheus等各种插件；其次确认集群支持的插件(不同云提供商支持的插件不同)；最后通过统一接口层获取总共CPU数量，任务请求使用的CPU数量，任务实际使用的CPU数量。最后入内存临时保存。(1.2) Cluster computing resource information collection module. First of all, the unified interface layer of this module completes the encapsulation of cluster resource collection plug-ins, supporting various plug-ins such as Heapster, Metrics-Server, and Prometheus; secondly, confirm the plug-ins supported by the cluster (provided by different clouds). The plug-ins supported by vendors are different); finally, the total number of CPUs, the number of CPUs requested by the task, and the number of CPUs actually used by the task are obtained through the unified interface layer. Finally, it is temporarily saved in the memory.

(1.3)计算任务状态信息采集模块，首先通过集群计算引擎的原生接口层获取当前集群所有的任务信息；其次使用一个状态分类器把所有的状态分类并提取关键信息；最后入内存临时保存。(1.3) The computing task state information collection module first obtains all task information of the current cluster through the native interface layer of the cluster computing engine; secondly, it uses a state classifier to classify all states and extract key information; finally, it is temporarily stored in the memory.

(2)上述步骤(1.1)、步骤(1.2)、步骤(1.3)完成一轮采集的时候，集群状态定义功能会加载内存中监控数据分析确定当前集群所处的状态。(2) When the above steps (1.1), (1.2), and (1.3) complete a round of collection, the cluster state definition function will load the monitoring data in the memory to analyze and determine the current state of the cluster.

(3)提取最新的监控数据，通过RPC请求上传到数据处理子系统。(3) Extract the latest monitoring data and upload it to the data processing subsystem through RPC request.

(4)根据调度系统的要求精简和处理监控数据，预存最新的结果供调度系统调用。(4) Streamline and process monitoring data according to the requirements of the dispatch system, and store the latest results for the dispatch system to call.

数据处理子系统接收监控数据信息并做相应的处理，具体步骤如下：The data processing subsystem receives the monitoring data information and performs corresponding processing. The specific steps are as follows:

(1)接收各个集群推送的监控数据；(1) Receive monitoring data pushed by each cluster;

(2)根据不同的维度分析监控信息，比如根据云提供商维度汇聚计算节点信息，CPU使用信息等，把处理后的数据持久化到数据库；(2) Analyze monitoring information according to different dimensions, such as gathering computing node information, CPU usage information, etc. according to cloud provider dimensions, and persisting the processed data to the database;

(3)根据账单系统的要求，处理账单系统可识别的方式并存入到指定时序数据库之中；(3) According to the requirements of the billing system, process the billing system in a recognizable way and store it in the designated time series database;

(4)拉取调度系统的实时任务数据，根据各个集群监控数据信息统一分析，筛选可转移的任务及相应的数目进行任务转移。(4) Pull the real-time task data of the scheduling system, analyze the monitoring data information of each cluster uniformly, and screen the transferable tasks and the corresponding number for task transfer.

告警子系统根据告警策略逻辑对监控数据信息并做相应的处理，具体步骤如下：The alarm subsystem processes the monitored data information according to the alarm strategy logic and the specific steps are as follows:

(1)根据监控数据信息执行集群信息相关的告警策略逻辑，区分异常等级执行告警动作。(1) Execute alarm strategy logic related to cluster information according to monitoring data information, and execute alarm actions based on abnormal levels.

(2)根据监控数据信息执行计算任务相关的告警策略，区分异常等级执行告警动作。对于低级别且预置了解决方案的异常，执行异常对应的解决方案。(2) According to the monitoring data information, execute the alarm strategy related to the calculation task, and execute the alarm action according to the abnormal level. For low-level exceptions with preset solutions, execute the solutions corresponding to the exceptions.

发明的有益效果The beneficial effects of the invention

Beneficial effect

本发明提供的面向多公有云计算平台的集群监控系统及其监控方法，具有以下技术效果：The cluster monitoring system and monitoring method for multiple public cloud computing platforms provided by the present invention have the following technical effects:

(1)通过集成各大云提供商的计算平台接口，通过进一步的处理消除各个云之前的差异，提供标准格式的监控资源数据，加速新集群的部署工作；完成对集群各种资源收集的插件的支持接口，在保证集群插件多样性的基础上可以保证监控数据的格式统一；监控分类功能可以把当前集群任务运行时按照运行进度展示在前台，使使用人员更好的观察整体的任务运行情况。(1) Through the integration of the computing platform interfaces of major cloud providers, through further processing to eliminate the differences between each cloud, provide standard format monitoring resource data, accelerate the deployment of new clusters; complete the plug-ins for collecting various resources of the cluster The support interface of, can ensure the uniformity of the format of the monitoring data on the basis of ensuring the diversity of the cluster plug-in; the monitoring classification function can display the current cluster task running in the foreground according to the running progress, so that the user can better observe the overall task running situation .

(2)通过一系列的监控数据分析模块，可以对集群监控数据进行分析和处理，进而反馈调度系统调整调度策略，提高资源使用率。并且分析学习调度数据和监控数据，调整任务分发，估算批任务完成时间，自动转移任务，缩短任务等待时间。(2) Through a series of monitoring data analysis modules, the cluster monitoring data can be analyzed and processed, and then feedback the scheduling system to adjust the scheduling strategy and improve the resource utilization rate. And analyze and study scheduling data and monitoring data, adjust task distribution, estimate batch task completion time, automatically transfer tasks, and shorten task waiting time.

(3)集中化的集群监控视图，可以看到各个集群的运行时状态包括(节点信息，CPU信息，任务信息等等)；汇聚的资源视图，可以看到以云提供商为级别的整体资源信息；计算资源持久化便于账单审计和对账。(3) Centralized cluster monitoring view, you can see the runtime status of each cluster including (node information, CPU information, task information, etc.); the aggregated resource view, you can see the overall resources at the cloud provider level Information; the persistence of computing resources facilitates bill auditing and reconciliation.

(4)动态监控各个集群资源使用率，任务状态等等，根据异常的严重程度使用不同渠道发送格式化的告警信息；并且一些预置的解决方案可以自动化的解决一部分的异常问题，减少人工参与。(4) Dynamically monitor the resource utilization rate, task status, etc. of each cluster, and use different channels to send formatted alarm information according to the severity of the abnormality; and some preset solutions can automatically solve some abnormal problems and reduce manual participation .

对附图的简要说明Brief description of the drawings

Description of the drawings

图1是本发明监控系统整体架构图；Figure 1 is a diagram of the overall architecture of the monitoring system of the present invention;

图2是本发明数据采集子系统的系统结构图；Figure 2 is a system structure diagram of the data acquisition subsystem of the present invention;

图3是本发明数据处理子系统的系统结构图；Figure 3 is a system structure diagram of the data processing subsystem of the present invention;

图4是本发明告警子系统的系统结构图；Figure 4 is a system structure diagram of the alarm subsystem of the present invention;

图5是本发明整个系统的实施流程图。Figure 5 is a flow chart of the implementation of the entire system of the present invention.

发明实施例Invention embodiment

Embodiments of the present invention

下面通过附图和实施例，对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be further described in detail below through the accompanying drawings and embodiments.

图1是系统的整体架构，用来说明各个子系统之间的关系以及数据流转流程：Figure 1 is the overall architecture of the system, used to illustrate the relationship between the various subsystems and the flow of data:

集群采集子系统运行在各个云的集群集群之上，把收集的监控数据上传的远端的数据处理系统上，数据经过一系列的分析之后被转储和发送的告警子系统，告警子系统根据告警策略处理监控数据进而进行告警动作。The cluster collection subsystem runs on the cluster clusters of each cloud, and uploads the collected monitoring data to the remote data processing system. After a series of analysis, the data is dumped and sent to the alarm subsystem. The alarm subsystem is based on The alarm strategy processes the monitoring data and then performs alarm actions.

图2是数据采集子系统的架构图，结合本图来说明此系统是如何工作的。Figure 2 is the architecture diagram of the data acquisition subsystem, combined with this figure to illustrate how this system works.

首先完成对云提供商计算集群，资源插件以及调度引擎APi的支持；其次根据模块的不同调用模块的统一接口层函数，完成对基础监控数据的采集，特别的针对任务信息，需要任务分类器对任务信息进行细分；然后汇聚此次监控数据，进行标准化的处理，并额外处理调度相关数据，等待S12(调度系统)调用并指导调度策略；最后上传此次的监控数据到数据处理子系统S11。First, complete support for cloud provider computing clusters, resource plug-ins, and scheduling engine APi; secondly, according to different modules, call the unified interface layer function of the module to complete the collection of basic monitoring data, especially for task information, a task classifier is required. Task information is subdivided; then the monitoring data is gathered, standardized processing, and additional processing related data, waiting for S12 (scheduling system) to call and guide the scheduling strategy; finally upload the monitoring data to the data processing subsystem S11 .

图3是数据处理子系统的架构图，结合本图来说明此系统是如何工作的。Figure 3 is the architecture diagram of the data processing subsystem, combined with this figure to illustrate how this system works.

首先子系统接收各个集群上传的监控数据，等每个集群都上传了一次数据或者在一定的时间内，子系统会对数据进行一次预处理，判断数据是否有效是否过期等等，若数据有效便复制一份发送给告警子系统(S21)；其次汇聚模块和账单信息处理模块便同时开始对监控数据进行处理，汇聚模块根据不同的维度将监控信息进行处理(如云提供商，计算引擎类型，任务类型等等)，处理后入库并同时展示在前端视图。账单信息处理模块根据账单系统要求筛选计算资源信息(CPU总数，请求数，节点数等)并按照云提供商的维度处理，然后存入时序数据库供账单系统(S23)调用。然后任务转移模块从调度系统拉取计算任务信息并结合汇聚的监控数据按照最大程度节省任务排队时间以及最大资源利用率的原则对任务进行迁移数据计算，比如把在一批集群长时间等待的任务转移到其他资源相对空闲的集群上，然后调用调度系统的迁移接口(S22)完成迁移动作。First, the subsystem receives the monitoring data uploaded by each cluster, and after each cluster has uploaded the data once or within a certain period of time, the subsystem will preprocess the data to determine whether the data is valid or not expired, etc., if the data is valid, Make a copy and send it to the alarm subsystem (S21); secondly, the aggregation module and the billing information processing module begin to process the monitoring data at the same time, and the aggregation module processes the monitoring information according to different dimensions (such as cloud provider, computing engine type, Task type, etc.), processed and stored in the library and displayed in the front-end view at the same time. The billing information processing module filters computing resource information (the total number of CPUs, the number of requests, the number of nodes, etc.) according to the billing system requirements, and processes them according to the dimensions of the cloud provider, and then stores them in the time series database for the billing system (S23) to call. Then the task transfer module pulls the computing task information from the scheduling system and combines the collected monitoring data to perform migration data calculation for tasks according to the principle of saving task queuing time and maximizing resource utilization, such as tasks that have been waiting for a long time in a batch of clusters Transfer to a cluster where other resources are relatively idle, and then call the migration interface (S22) of the scheduling system to complete the migration action.

图4是告警子系统的架构图，结合本图来说明此系统是如何工作的。Figure 4 is the architecture diagram of the alarm subsystem, combined with this figure to illustrate how this system works.

首先从数据采集子系统获取一次集群整体的监控数据，告警策略处理模块根据子系统的策略逻辑进行数据处理；然后根据不同的告警类型发送到各自的告警模块，集群资源使用率告警模块接收到告警数据后，区分告警异常的等级发送监控数据；计算任务运行状态告警模块接收到告警数据后，发送告警数据并且根据告警查找是否有预置的解决方案，若有便执行此解决方案并通知相关人员。First obtain the overall monitoring data of the cluster from the data collection subsystem, and the alarm strategy processing module will process the data according to the strategy logic of the subsystem; then send it to the respective alarm module according to different alarm types, and the cluster resource usage alarm module receives the alarm. After the data is collected, the monitoring data is differentiated according to the level of the alarm abnormality; after the calculation task running state alarm module receives the alarm data, it sends the alarm data and finds whether there is a preset solution according to the alarm, and executes the solution if there is one and informs the relevant personnel .

图5是整个系统的实施示意图，具体运行过程如下：Figure 5 is a schematic diagram of the implementation of the entire system, the specific operation process is as follows:

步骤一，，新集群接入集群，确认是否已经支持此云提供商，若不支持，需要完成对云提供商支持；Step 1: Connect the new cluster to the cluster and confirm whether the cloud provider is already supported. If not, support for the cloud provider needs to be completed;

步骤二，确认集群资源插件和是否已经支持此插件，若支持，直接部署收集子系统即可，若不支持，需要完成对此插件的支持，然后再部署；Step two, confirm whether the cluster resource plug-in and whether the plug-in is already supported, if it is supported, just deploy the collection subsystem directly, if not, you need to complete the support for the plug-in before deploying;

步骤三，采集集群监控数据。三种数据是并行采集，并在采集任务信息之后对其使用分类器分类；Step 3: Collect cluster monitoring data. The three types of data are collected in parallel, and after collecting task information, they are classified using a classifier;

步骤四，对监控数据进行标准化，目的是方便存储和分析；Step four, standardize the monitoring data, the purpose is to facilitate storage and analysis;

步骤五，对调度系统要求的数据进行分析并缓存，供调度系统调用；Step 5: Analyze and cache the data required by the scheduling system for the scheduling system to call;

步骤六，上传监控数据到数据处理子系统，数据采集子系统流程结束；Step 6, upload monitoring data to the data processing subsystem, and the data acquisition subsystem process ends;

步骤七，数据处理系统预处理各个集群上传的监控数据，确认有效性；Step 7. The data processing system preprocesses the monitoring data uploaded by each cluster to confirm the validity;

步骤八，按照账单系统要求处理账单相关数据并存入时序数据库供账单系统使用；Step 8. Process the bill-related data according to the billing system requirements and store it in the time series database for use by the billing system;

步骤九，根据不同维度汇聚账单信息并入库存储；Step 9: Gather the bill information according to different dimensions and store it in the database;

步骤十，拉取任务信息结合监控数据完成任务分析和转移；Step 10, pull task information and monitor data to complete task analysis and transfer;

步骤十一，前端实时更新展示，数据处理子系统流程结束；Step 11, the front-end updates the display in real time, and the data processing subsystem flow ends;

步骤十二，告警处理模块接收监控数据并开始按照告警策略处理，此步骤在步骤七之后即开始执行；Step 12: The alarm processing module receives the monitoring data and starts processing according to the alarm strategy. This step starts to be executed after step 7;

步骤十三，集群资源使用率告警模块区分告警等级发送告警信息；Step 13: The cluster resource usage alarm module differentiates the alarm level and sends alarm information;

步骤十四，计算任务运行状态告警模区分告警等级发送告警信息；Step 14. The computing task running state alarm module sends alarm information according to the alarm level;

步骤十五，查找时候有任务状态告警预置的解决方案，若有则执行解决方案并通知相关人员；Step 15: When searching for solutions with task status alarm presets, if there are solutions, execute the solutions and notify relevant personnel;

步骤十六，整个流程结束。Step 16, the whole process ends.

Claims

The cluster monitoring system for multiple public cloud computing platforms is characterized by three subsystems:

The data collection subsystem is responsible for collecting basic resource data, task operation status and consumption, and overall resource usage monitoring data of each cloud computing platform according to the established indicators, and provides an interface for the scheduling system to call to obtain real-time monitoring data to guide scheduling;

The data processing subsystem is responsible for the cluster monitoring data uploaded by each data subsystem through RPC request, performs a series of processing and dumps it to the back-end for storage, and provides interfaces for data display and billing system functions;

The alarm subsystem is responsible for processing and analyzing monitoring data according to the alarm strategy, confirming the alarm level and sending alarm information. For the following low-level alarms, the alarm subsystem can process and recover by itself according to the preset method.

The cluster monitoring system for multiple public cloud computing platforms according to claim 1, wherein the data collection subsystem includes three modules, which are:

The cluster node information collection module is responsible for collecting the actual number of nodes, the planned number of nodes, and the maximum number of supported nodes of the computing cluster nodes on each cloud computing provider;

The cluster node information collection module implements a unified application interface layer, which is used to interface with the basic monitoring interfaces of various cloud providers;

The cluster computing resource information collection module is responsible for collecting the CPU information of each computing cluster, including the total number of CPUs, the number of CPUs used by task requests, and the number of CPUs actually used by the task; the cluster computing resource information collection module deploys third-party plug-ins in the cluster, A unified resource interface package is implemented for these plug-ins, and the number of CPUs requested by the task and the number of CPUs actually used by the task can be obtained in real time; for the total number of CPUs, the acquisition interface is implemented according to the conditions of each cloud;

The computing task status collection module collects all task information in the cluster in real time, classifies tasks through a classification sub-module, and the task information provides data support for the alarm system.

The cluster monitoring system oriented to multiple public cloud computing platforms according to claim 2, wherein the data collection subsystem also includes a cluster state definition function, and various states that will appear in the cluster computing process are intuitively reflected Based on the current runtime situation of the cluster, a piece of monitoring information to guide the scheduling is formed according to the requirements of the scheduling system.

The cluster monitoring system for multiple public cloud computing platforms according to claim 1, wherein the data processing subsystem includes three modules, respectively:

The monitoring information aggregation module is responsible for processing the monitoring data uploaded by the data collection subsystem, performing a series of processing and analysis on the cluster information and task operation information on each cloud, and doing different levels of aggregation and then dumping for display and auditing. ；

The bill information processing module, according to the requirements of the bill, processes the data required for bill analysis and stores it in the time series database every minute;

The task transfer module is responsible for automatically transferring tasks to clusters with low load based on the monitoring information of each cluster and the task information of the scheduling system.

The cluster monitoring system for multiple public cloud computing platforms according to claim 1, characterized in that, in the alarm subsystem, the alarm type includes three modules, namely:

The alarm strategy processing module is responsible for implementing the data processing logic of the alarm strategy. Different alarm strategies have different processing logic;

The cluster resource usage alarm module is responsible for executing the alarm strategy logic related to the cluster information, classifying the strategy processing results, and then selecting different channels to send the alarm information according to the severity of the alarm;

The computing task running status alarm module is responsible for executing the alarm strategy related to the computing task. The priority of various status detection is defined and the order detection is performed according to the priority, and the task in the abnormal state is sent alarm information processing; at the same time, for Some low-level exceptions define relative solutions, and the solutions will be executed at the same time as the alarm information is sent.

The monitoring method of a cluster monitoring system oriented to multiple public cloud computing platforms according to any one of claims 1 to 5, characterized in that it comprises the following steps:

The data collection subsystem is distributed and runs on the computing cluster of the cloud provider. The specific steps include:

(1) Through a unified interface layer function, the collection of cluster node information, cluster resource information, and computing task status information is performed synchronously;

(2) The above steps (1) When a round of collection is completed, the cluster state definition function will load the monitoring data in the memory to analyze and determine the current state of the cluster;

(3) Extract the latest monitoring data and upload it to the data processing subsystem through RPC request;

(4) Streamline and process monitoring data according to the requirements of the dispatch system, and store the latest results for the dispatch system to call;

The data processing subsystem receives the monitoring data information and performs corresponding processing. The specific steps are as follows:

(1) Receive monitoring data pushed by each cluster;

(2) Analyze the monitoring information according to different dimensions, and persist the processed data to the database;

(3) According to the requirements of the billing system, process the billing system in a recognizable way and store it in the designated time series database;

(4) Pull the real-time task data of the scheduling system, analyze the monitoring data information of each cluster in a unified manner, filter the transferable tasks and the corresponding number for task transfer;

The alarm subsystem processes the monitored data information according to the alarm strategy logic and the specific steps are as follows:

(1) Execute alarm strategy logic related to cluster information according to monitoring data information, distinguish abnormal levels and execute alarm actions;

(2) Execute alarm strategies related to computing tasks based on monitoring data information, distinguish abnormal levels and perform alarm actions; for low-level abnormalities with preset solutions, execute the corresponding solutions for the abnormalities.

The monitoring method for a cluster monitoring system oriented to multiple public cloud computing platforms according to claim 6, characterized in that the step (1) of the data collection subsystem specifically includes the following steps:

(1.1) Cluster node information collection module. First, the unified interface layer of this module completes the access to the cloud provider, and then uses the unified interface layer to obtain cluster node information, that is, the actual number of nodes, the planned number of nodes, and the maximum number of supported nodes; and finally Temporarily save in memory;

(1.2) Cluster computing resource information collection module. First of all, the unified interface layer of this module completes the packaging of cluster resource collection plug-ins; secondly, confirm the plug-ins supported by the cluster; finally obtain the total number of CPUs and the number of CPUs requested by the task through the unified interface layer , The number of CPU actually used by the task; it is finally stored in the memory temporarily;

(1.3) The computing task state information collection module first obtains all task information of the current cluster through the native interface layer of the cluster computing engine; secondly, it uses a state classifier to classify all states and extract key information; finally, it is temporarily stored in the memory.