CN117793108A

CN117793108A - Large-scale cloud server cluster management method and device

Info

Publication number: CN117793108A
Application number: CN202311831021.3A
Authority: CN
Inventors: 张益兵; 孙利杰; 陈松政
Original assignee: Hunan Qilin Xin'an Technology Co ltd
Current assignee: Hunan Qilin Xin'an Technology Co ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-29

Abstract

The embodiment of the invention provides a large-scale cloud server cluster management method and device, and relates to the technical field of server cluster management technology. The method comprises the following steps: the management node sends heartbeat information to domain master nodes in the cluster domain at regular time according to a preset period in a first multicast domain communication mode; and the domain master node sends response information to the management node under the condition of receiving the heartbeat information. The invention solves the problem of cluster function failure caused by abnormal management nodes, thereby achieving the effect of effective management of the cluster function.

Description

Large-scale cloud server cluster management method and device

Technical Field

The embodiment of the invention relates to the technical field of cloud server clusters, in particular to a large-scale cloud server cluster management method and device.

Background

Currently, one or more management nodes are adopted in the cloud server cluster in the industry, and the functions of the cloud server cluster are realized by the architecture of a plurality of computing nodes. And managing all the computing nodes in the cluster by taking the management node as a center through a network mode.

The centralized cluster is limited by the processing capacity of the management node, the size of the cluster which is often supported is limited, when the abnormality occurs on a large scale, the network storm easily occurs, so that the abnormal host or virtual machine cannot be processed for a long time, and secondly, when the abnormality occurs on the management node, the function of the whole cluster is invalid.

There is currently no better solution to the above problems.

Disclosure of Invention

The embodiment of the invention provides a large-scale cloud server cluster management method and device, which are used for at least solving the problem of cluster function failure caused by abnormal management nodes in the related technology.

According to one embodiment of the present invention, there is provided a large-scale cloud server cluster management method including:

the method comprises the steps that a management node sends heartbeat information to domain master nodes in a cluster domain according to a preset period in a first multicast domain communication mode, wherein the domain master nodes are in one-to-one correspondence with the cluster domain, the domain master nodes are in communication connection with computing nodes in the cluster domain, the heartbeat information at least comprises any one of management node identifiers and random sequence numbers of the management nodes, and the first multicast domain communication mode comprises communication between the management nodes and the corresponding domain master nodes;

and the domain master node sends response information to the management node under the condition of receiving the heartbeat information, wherein the response information at least comprises any one of a domain master node identifier of the domain master node and the random sequence number.

In an exemplary embodiment, after the management node sends heartbeat information to the domain master node in the cluster domain according to a preset period, the method further includes:

in the case that a first number of the computing nodes do not receive first heartbeat information sent by a first domain master node in a continuous first period or the management node does not receive first response information fed back by the first domain master node based on the heartbeat information in a continuous target period, the management node and/or the computing nodes initiate domain master node reselection processing to determine a second domain master node, wherein the second domain master node is any computing node included in the cluster, and the response information includes the first response information;

the second domain master node sends first heartbeat information to the computing nodes included in the cluster domain, or the second domain master node feeds back second response information based on the heartbeat information sent by the management node, wherein the response information comprises the second response information;

and the management node performs first updating processing on the domain master node list based on the second response information.

In an exemplary embodiment, after the management node initiates a domain master reselection process to determine a second domain master, the method further comprises:

the first domain master node feeds back third response information to the second domain master node based on the first heartbeat information;

and the second domain master node feeds back management information containing the state information of the first domain master node to the management node based on the third response information, and the management node performs second updating processing on the domain master node list based on the management information.

In an exemplary embodiment, the method further comprises:

the domain master node sends second heartbeat information to a first computing node contained in the cluster domain at regular time according to a preset period through a second multicast domain communication mode, wherein the second multicast domain communication mode comprises communication between the domain master node and the corresponding computing node;

the domain master node performs fault detection on the first computing node under the condition that the domain master node does not receive fourth response information fed back by the first computing node based on the second heartbeat information in a continuous second period so as to obtain first detection information;

switching the first computing node into a fault domain and sending first state information to the management node under the condition that the first detection information indicates that the first computing node is in a fault state;

and the management node performs third updating processing on the node list in the cluster domain based on the first state information.

In an exemplary embodiment, after said switching said first computing node into the failure domain and sending first state information to said management node, said method further comprises:

performing recovery detection on the first computing node, switching the first computing node to the cluster domain after detecting that the first computing node is recovered from faults, and sending second state information to the management node;

and the management node performs fourth updating processing on the node list in the cluster domain based on the second state information.

According to another embodiment of the present invention, there is provided a large-scale cloud server cluster management system including:

a cluster domain provided with a plurality of computing nodes;

the domain master node is in communication connection with a plurality of computer nodes in the cluster domain and is used for monitoring the running states of all the computing nodes in the cluster domain, and the domain master node corresponds to the cluster domain one by one;

and the management node is in communication connection with the domain master node and is used for monitoring the running state of the domain master node.

In an exemplary embodiment, the management node sends heartbeat information to domain master nodes in a cluster domain according to a preset period in a first multicast domain communication mode, wherein the domain master nodes are in one-to-one correspondence with the cluster domain, the domain master nodes are in communication connection with computing nodes in the cluster domain, the heartbeat information at least comprises any one of management node identifiers and random sequence numbers of the management nodes, and the first multicast domain communication mode comprises communication between the management nodes and the corresponding domain master nodes;

In an exemplary embodiment, after the management node sends heartbeat information to the domain master nodes in the cluster domain according to a preset period, if a first number of the computing nodes do not receive first heartbeat information sent by the first domain master node in a continuous first period or the management node does not receive first response information fed back by the first domain master node based on the heartbeat information in a continuous target period, the management node and/or the computing nodes initiate domain master node reselection processing to determine a second domain master node, where the second domain master node is any computing node included in the cluster domain, and the response information includes the first response information;

According to a further embodiment of the invention, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to a further embodiment of the invention, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to the invention, the cluster domain is introduced, the domain master node of the cluster domain is used for managing the computing nodes, and the management node is used for managing the domain master node, so that the decentralization of the large-scale cluster is realized, and the functions of the management node are weakened, therefore, the problem of cluster function failure caused by the abnormality of the management node can be solved, and the effect of improving the effective self-management of the cluster function is achieved.

Drawings

Fig. 1 is a hardware block diagram of a mobile terminal of a large-scale cloud server cluster management method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of large-scale cloud server cluster management according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a large-scale cloud server cluster management in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a first embodiment of the present invention;

FIG. 5 is a second schematic diagram of an embodiment of the present invention;

fig. 6 is a block diagram of a large-scale cloud server cluster management system according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking the operation on a mobile terminal as an example, fig. 1 is a hardware structure block diagram of a mobile terminal of a large-scale cloud server cluster management method according to an embodiment of the present invention. As shown in fig. 1, a mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the mobile terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for managing a cluster of a large-scale cloud server in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In this embodiment, a method for managing a large-scale cloud server cluster is provided, and fig. 2 is a flowchart of a method for managing a large-scale cloud server cluster according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

step S201, a management node sends heartbeat information to domain master nodes in a cluster domain according to a preset period in a first multicast domain communication mode, wherein the domain master nodes are in one-to-one correspondence with the cluster domain, the domain master nodes are in communication connection with computing nodes in the cluster domain, the heartbeat information at least comprises any one of management node identifiers and random sequence numbers of the management nodes, and the first multicast domain communication mode comprises communication between the management nodes and the corresponding domain master nodes;

step S202, the domain master node sends response information to the management node when receiving the heartbeat information, where the response information at least includes any one of a domain master node identifier of the domain master node and the random sequence number.

In this embodiment, the management node periodically sends heartbeat information to the domain master node of each cluster domain, if the domain master node is normal, the corresponding information is fed back to the management node, otherwise, the corresponding information is not fed back, so that the management of the domain master node can be realized; similarly, the domain master node may also send heartbeat information to each computing node in the cluster domain, and determine whether the corresponding computing node is normal through feedback thereof.

The cluster domain is to divide all servers into a plurality of groups, each group of servers corresponds to one cluster domain, each cluster domain is provided with a management device for managing the computing nodes in the domain, and the management device is a domain master node; by introducing a cluster domain, the problem of decentralization of a large-scale server cluster is solved.

Specifically, as shown in fig. 3, the cluster architecture adopts a divide-and-conquer idea, and each domain is responsible for managing all computing nodes under the domain, and relates to a management node, a domain master node and a computing node. In order to solve the single fault of the management node, a main and standby mode is adopted, and unified virtual IP is used for providing services to the outside; the management node manages and monitors the running state of the domain master node, and the domain master node manages and monitors the running states of all the computing nodes in the domain, and reports the running state of the domain and the monitoring information to the management node respectively.

The cluster domain is composed of a domain master node and a plurality of computing nodes. The domain master node is responsible for managing and monitoring all the computing nodes in the cluster domain. When the cluster is initialized, determining the scale (10-30) of the cluster domain through the global configuration file; meanwhile, in order to save network bandwidth in the cluster and reduce network load, each cluster domain is a multicast domain, and when the cloud server joins the cluster domain, the multicast domain is joined. When the cluster is initialized, the unused multicast address is obtained from the multicast address pool (239.0.0.1-239.255.255.255), and the multicast address of each cluster domain is determined.

When the management node and the domain master node in the cluster domain are used as a multicast domain, and the cluster domain is created, the management node is added into the multicast domain when the domain master node is selected in the cluster domain, and the multicast domain is used as a primary multicast domain and is responsible for the communication between the management node and the domain master node in the cluster domain (namely, the first multicast domain communication mode).

When the cluster is started, the heartbeat cycle of all nodes in the cluster is required to be set, and the range (30 s-10 min) is used as a global parameter.

The management nodes in the cluster are used as data sources, heartbeat information is sent at fixed time through a first-level multicast domain, unique identifiers (corresponding to the management node identifiers) of the data sources are carried, and meanwhile, random serial numbers are carried in messages of the heartbeat information; when the domain master node receives the heartbeat information, determining the accuracy of the source of the information, and then responding to the information, and replying an ack (corresponding to the response information) carrying the unique identifier of the domain master node and the received serial number; in a cluster, each node has a unique identifier, such as uuid, as identity information of the cluster.

The method introduces the concept of a cluster domain, takes part of hosts as a cluster, and carries out management monitoring on the hosts in the cluster domain when the management node is abnormal, so as to weaken the function of the management node. More cloud servers are supported in the cluster, so that the burden of the management node is not increased due to the increase of the cloud servers, and the management node becomes a bottleneck in the cluster.

And secondly, multicast is used for communication among nodes in the cluster domain, so that network bandwidth is saved, and network load in the cluster is reduced.

In an alternative embodiment, after the management node sends heartbeat information to the domain master node in the cluster domain according to a preset period, the method further includes:

step S2011, in a case that a first number of the computing nodes do not receive first heartbeat information sent by a first domain master node in a continuous first period, or the management node does not receive first response information fed back by the first domain master node based on the heartbeat information in a continuous target period, the management node and/or the computing nodes initiate domain master node reselection processing to determine a second domain master node, where the second domain master node is any computing node included in the cluster domain, and the response information includes the first response information;

step S2012, the second domain master node sends first heartbeat information to the computing nodes included in the cluster domain, or the second domain master node feeds back second response information based on the heartbeat information sent by the management node, where the response information includes the second response information;

and step S2013, the management node performs first updating processing on the domain master node list based on the second response information.

In this embodiment, when the computing node or the management node does not receive the related information, the computing node or the management node automatically reselects any computing node as the domain master node, so as to ensure the stability of system communication.

Specifically, as shown in fig. 4, when the domain master node fails, the domain master node (corresponding to the first domain master node) is used as a data source of the multicast domain, and cannot send heartbeat information at regular time, and cannot respond to the heartbeat information sent by the management node; when more than half of the computing nodes in the cluster domain (corresponding to the first number) continuously do not receive heartbeat information in three periods (corresponding to the first period), the computing nodes or the management nodes in the cluster domain reinitiate the selected domain master node (corresponding to the domain master node reselection processing), after the domain master node is determined, the new domain master node (corresponding to the second domain master node) is added into the primary multicast domain to continuously respond to the heartbeat information of the management node, and then the management node updates the domain master node list.

It should be noted that, the process of reselecting the domain master node is initiated by the computing node relatively better, because there is a communication relationship between the computing node and the domain master node, and there are more standby options for the computing node, and a direct communication path is absent between the management node and the computing node, so that the management node cannot effectively know the real states of the computing nodes, and thus cannot effectively select.

In an alternative embodiment, after the management node initiates a domain master reselection process to determine a second domain master, the method further comprises:

step S2014, the first domain master node feeds back third response information to the second domain master node based on the first heartbeat information;

in step S2015, the second domain master node feeds back management information including the state information of the first domain master node to the management node based on the third response information, and the management node performs a second update process on the domain master node list based on the management information.

In this embodiment, after the domain master node (corresponding to the first domain master node) fails to recover, the recovered domain master node can only be used as a computing node to join the cluster domain, respond to the heartbeat information sent by the new domain master node (corresponding to the second domain master node), and report the state of the node (corresponding to the first domain master node) to the management node after the new domain master node (corresponding to the second domain master node) receives the response message, so that the management node updates the node list state in the cluster.

In an alternative embodiment, the method further comprises:

step 203, the domain master node sends second heartbeat information to a first computing node included in the cluster domain according to a preset period through a second multicast domain communication mode, wherein the second multicast domain communication mode includes communication between the domain master node and the corresponding computing node;

step S203, where the domain master node does not receive the fourth response information fed back by the first computing node based on the second heartbeat information in the continuous second period, the domain master node performs fault detection on the first computing node to obtain first detection information;

step S204, switching the first computing node into a fault domain and sending first state information to the management node when the first detection information indicates that the first computing node is in a fault state;

in step S205, the management node performs a third update process on the node list in the cluster domain based on the first state information.

In this embodiment, when a computing node fails, the state of the computing node is periodically detected, and after the computing node recovers, the corresponding node state is updated, so as to facilitate subsequent processing of the computing node.

As shown in fig. 5, the domain master node is used as a data source, and a communication mode of a secondary multicast domain is also adopted, so that heartbeat information is sent at fixed time, and nodes in the cluster domain detect the validity of the source of the message and respond timely, wherein the secondary multicast domain refers to the communication range between the domain master node and all computing nodes in the cluster domain (corresponding to the second multicast domain communication mode).

When the computing node a fails, the heartbeat message of the domain master node cannot be responded, when the domain master node continuously does not receive the heartbeat response message in three periods (corresponding to the second period), the domain master node performs fault detection on the computing node, when the computing node a is detected to be in a fault state, the computing node a is placed in the fault domain, and the management node is reported and simultaneously updates the node state of the corresponding node list in the cluster.

In an alternative embodiment, after said switching said first computing node into the failure domain and sending first state information to said management node, said method further comprises:

step S2041, performing recovery detection on the first computing node, switching the first computing node to the cluster domain after detecting that the first computing node is failed to recover, and sending second status information to the management node;

step S2042, the management node performs a fourth update process on the node list in the cluster domain based on the second state information.

In this embodiment, when the computing node fails to recover, the relevant information of the computing node is uploaded in time, so as to facilitate subsequent processing of the computing node.

Specifically, after the computing node is restored, the computing node joins the cluster domain secondary multicast domain where the computing node is located before, responds to the heartbeat information, and sends a response message to the corresponding domain master node, and after the domain master node receives the response message, the domain master node reports the state of the node (corresponding to the second state information) to the management node, and the management node updates the node list state in the cluster.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiment also provides a large-scale cloud server cluster management system, which is used for implementing the above embodiment and the preferred implementation manner, and the description is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 6 is a block diagram of a large-scale cloud server cluster management system according to an embodiment of the present invention, as shown in fig. 6, the system includes:

a cluster domain 61 provided with a plurality of computing nodes;

a domain master node 62, communicatively connected to a plurality of computer nodes in the cluster domain, for monitoring operation states of all the computing nodes in the cluster domain, where the domain master node corresponds to the cluster domain one by one;

and the management node 63 is in communication connection with the domain master node and is used for monitoring the running state of the domain master node.

In an optional embodiment, the management node sends heartbeat information to domain master nodes in a cluster domain according to a preset period in a first multicast domain communication mode, wherein the domain master nodes are in one-to-one correspondence with the cluster domain, the domain master nodes are in communication connection with computing nodes in the cluster domain, the heartbeat information at least comprises any one of management node identifiers and random sequence numbers of the management nodes, and the first multicast domain communication mode comprises communication between the management nodes and the corresponding domain master nodes;

In an optional embodiment, after the management node sends heartbeat information to the domain master nodes in the cluster domain according to a preset period, if a first number of the computing nodes do not receive first heartbeat information sent by the first domain master nodes in a continuous first period or the management node does not receive first response information fed back by the first domain master nodes based on the heartbeat information in a continuous target period, the management node and/or the computing nodes initiate domain master node reselection processing to determine a second domain master node, where the second domain master node is any computing node included in the cluster domain, and the response information includes the first response information;

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.

Embodiments of the present invention also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for large-scale cloud server cluster management, comprising:

2. The method according to claim 1, wherein after the management node transmits heartbeat information to the domain master node in the cluster domain according to a preset period, the method further comprises:

3. The method of claim 2, wherein after the management node initiates a domain master reselection process to determine a second domain master, the method further comprises:

4. The method according to claim 1, wherein the method further comprises:

5. The method of claim 4, wherein after said switching said first computing node into a failure domain and sending first state information to said management node, said method further comprises:

6. A large-scale cloud server cluster management system, comprising:

a cluster domain provided with a plurality of computing nodes;

7. The system of claim 6, wherein the management node sends heartbeat information to domain master nodes in a cluster domain at regular time according to a preset period through a first multicast domain communication mode, wherein the domain master nodes are in one-to-one correspondence with the cluster domain, the domain master nodes are in communication connection with computing nodes in the cluster domain, the heartbeat information at least comprises any one of management node identifiers and random sequence numbers of the management nodes, and the first multicast domain communication mode comprises communication between the management nodes and the corresponding domain master nodes;

8. The system according to claim 7, wherein after the management node sends heartbeat information to domain master nodes in a cluster domain according to a preset period, if a first number of the computing nodes do not receive first heartbeat information sent by a first domain master node in a continuous first period or the management node does not receive first response information fed back by the first domain master node based on the heartbeat information in a continuous target period, the management node and/or the computing nodes initiate domain master node reselection processing to determine a second domain master node, where the second domain master node is any computing node included in the cluster domain, and the response information includes the first response information;

9. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program is arranged to execute the method of any of the claims 1 to 5 when run.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 5.