US20160036654A1 - Cluster system - Google Patents
Cluster system Download PDFInfo
- Publication number
- US20160036654A1 US20160036654A1 US14/879,253 US201514879253A US2016036654A1 US 20160036654 A1 US20160036654 A1 US 20160036654A1 US 201514879253 A US201514879253 A US 201514879253A US 2016036654 A1 US2016036654 A1 US 2016036654A1
- Authority
- US
- United States
- Prior art keywords
- node
- node devices
- controller
- processor
- down state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000001514 detection method Methods 0.000 claims abstract description 19
- 238000000034 method Methods 0.000 claims description 28
- 238000007726 management method Methods 0.000 claims description 22
- 238000012544 monitoring process Methods 0.000 claims description 14
- 238000004891 communication Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 230000007423 decrease Effects 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 210000001624 hip Anatomy 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/50—Network service management, e.g. ensuring proper service fulfilment according to agreements
- H04L41/5003—Managing SLA; Interaction between SLA and QoS
- H04L41/5019—Ensuring fulfilment of SLA
- H04L41/5025—Ensuring fulfilment of SLA by proactively reacting to service quality change, e.g. by reconfiguration after service quality degradation or upgrade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/2033—Failover techniques switching over of hardware resources
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0817—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0876—Network utilisation, e.g. volume of load or congestion level
Definitions
- the present invention relates to a cluster system, specifically, a cluster system configured by a plurality of nodes and managing whether the nodes are alive.
- a cluster system configured by a plurality of nodes as shown in Patent Document 1 has a redundant configuration to, even when a node providing a service comes into a down state, take over the service to another node, thereby guaranteeing the quality of the service.
- it is also an issue for clusterware installed in such a cluster system how quickly and accurately grasp the states (operation statues, or whether a fault has occurred or not) of the nodes in order to realize higher SLA.
- Alive monitoring of nodes in a cluster system is performed in a manner that the nodes check the operation states each other by using, as a communication path, something that enables the nodes to exchange information, such as a LAN (Local Area Network), serial ports or a shared disk.
- LAN Local Area Network
- serial ports or a shared disk.
- Patent Document 1 Japanese Unexamined Patent Application Publication No. JP-A 2006-79161
- a LAN, serial ports, a shared disk and so on are all controlled as management resources of an OS (Operating System), and therefore, are affected by the operation state of the OS, other than a physical fault of a communication path.
- OS Operating System
- the specific node is considered to be in the down state by the other nodes though the node is not down actually.
- a node goes down due to a hardware fault or the like, it takes a specific time or more before the node is judged to be in the down state, and therefore, it is impossible to instantly execute system switching.
- a CPU Central Processing Unit
- an object of the present invention is to solve the abovementioned problem, “the reliability of the system decreases.”
- a cluster system of an exemplary embodiment of the present invention is a cluster system including a plurality of node devices.
- Each of the node devices is connected with the other node devices by a first network and a second network, and includes:
- a first node managing unit configured to operate on an operating system embedded in an own device and detect operation statuses of the other node devices via the first network;
- a second node managing unit configured to operate without being affected by the operating system and detect operation statuses of the other node devices via the second network
- a node status judging unit configured to judge whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- a program of another exemplary embodiment of the present invention is a program for causing each of a plurality of node devices configuring a cluster system including the plurality of node devices, to realize:
- a first node managing unit configured to operate on an operating system embedded in an own device and detect operation statuses of the other node devices via a first network connected to the other node devices;
- a second node managing unit configured to operate without being affected by the operating system and detect operation statuses of the other node devices via a second network connected to the other node device;
- a node status judging unit configured to judge whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- a node management method of another exemplary embodiment of the present invention includes, in a cluster system including a plurality of node devices:
- a first node managing unit configured to operate on an operating system embedded in each of the node devices, detecting operation statuses of the other node devices via a first network connected with the other node devices;
- a second node managing unit configured to operate without being affected by the operating system embedded in the node device, detecting operation statuses of the other node devices via a second network connected with the other node device;
- each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- the present invention can increase the reliability of a cluster system.
- FIG. 1 is a block diagram showing the configuration of a cluster system in a first exemplary embodiment of the present invention
- FIG. 2 is a block diagram showing the configuration of a node configuring the cluster system disclosed in FIG. 1 ;
- FIG. 3 is an explanation diagram for explaining the operation of the cluster system disclosed in FIG. 1 ;
- FIG. 4 is a flowchart showing the operation of a cluster controlling unit of the node disclosed in FIG. 2 ;
- FIG. 5 is a flowchart showing the operation of a node managing unit of the node disclosed in FIG. 2 ;
- FIG. 6 is a flowchart showing the operation of an operation status transmitting unit of the node disclosed in FIG. 2 ;
- FIG. 7 is a flowchart showing the operation of an operation status receiving unit of the node disclosed in FIG. 2 ;
- FIG. 8 is a flowchart showing the operation of a BMC node managing unit of the node disclosed in FIG. 2 ;
- FIG. 9 is a flowchart showing the operation of a BMC operation status acquiring unit of the node disclosed in FIG. 2 ;
- FIG. 10 is a flowchart showing the operation of a BMC controlling unit of the node disclosed in FIG. 2 ;
- FIG. 11 is a flowchart showing the operation of a hardware monitoring unit of the node disclosed in FIG. 2 ;
- FIG. 12 is a block diagram showing the configuration of a cluster system in a second exemplary embodiment of the present invention
- FIG. 13 is a block diagram showing the configuration of a virtual infrastructure configuring the cluster system disclosed in FIG. 12 ;
- FIG. 14 is a block diagram showing the configuration of a cluster system in Supplementary Note 1 of the present invention.
- a cluster system (also referred to as a “cluster” hereinafter) according to the present invention includes a plurality of node devices (also referred to as “nodes” hereinafter).
- the respective nodes execute alive monitoring each other.
- the cluster system has a function to, in a case that one node comes into a down state, execute a system switching process of causing another node to restart a service having been executed by the one node. Below, the cluster system according to the present invention will be described.
- FIGS. 1 and 2 are diagrams for describing the configuration of the cluster system.
- FIGS. 3 to 11 are views for describing the operation of the cluster system.
- the cluster system according to the present invention includes a plurality of node devices as shown with a node ( 1 ) 101 , a node ( 2 ) 102 , and a node (N) 103 .
- the node devices 101 . . . are each configured by an information processing device like a server computer.
- the node devices 101 . . . may be each configured by an information processing device virtually structured as explained in a second exemplary embodiment described later.
- the number of the node devices 101 . . . configuring the cluster system according to the present invention is not limited to the number thereof in FIG. 1 .
- the node devices 101 . . . described above, in each of which an operating system (also referred to as “OS” hereinafter) is embedded, has service units 106 . . . for performing a predetermined service process provided to users and clusterwares 107 . . . for controlling the operation of the cluster system, respectively.
- the service units 106 . . . and the clusterwares 107 . . . are structured by embedding programs into arithmetic devices installed in the node devices 101 . . . , respectively.
- the node devices 101 . . . will be shown and described with reference numeral 201 in FIG. 2 .
- the service unit 106 included in the node device 101 which is an active system among the node devices 101 . . . configuring the cluster system, operates and provides a service process to a user.
- the service units 109 and 112 included in the other node devices 102 and 103 which are standby systems, are on standby (refer to dotted lines in FIG. 1 ).
- the clusterware 107 controls a process to start or stop the service units 106 . . . .
- the clusterware 107 executes system switching, which is switching between the active system and the standby system regarding the node devices, and the service unit included in another one of the node devices restarts the service.
- the clusterware 107 which is denoted by reference numeral 203 in FIG. 2 , includes a cluster controlling unit 205 , a node managing unit 206 , an operation status transmitting unit 207 and an operation status receiving unit 208 as shown in FIG. 2 .
- the node managing unit 206 manages a node list A 209 for holding “identifiers,” “addresses” and “operation statuses” of all of the nodes contained in the cluster system.
- the respective units 205 to 208 and the node list A 209 will be described in detail in explanation of the operation later.
- the respective node devices 101 . . . are connected to a wired or wireless LAN (Local Area Network) (a first network).
- the respective node devices 101 . . . are enabled to perform communication with each other via the LAN and a network switch ( 1 ) 104 by the clusterware 107 operating on the operating system.
- the node devices 101 . . . include baseboard management controllers 108 . . . , respectively.
- Each of the baseboard management controllers 108 . . . operates as firmware implemented on hardware such as a processor installed in each of the node device 101 . . . , and operates independently of the aforementioned OS embedded in each of the node device 101 . . . . Therefore, even when any of the node devices 101 . . . comes to a standstill, a baseboard management controller 204 installed in each of the node devices 101 . . . can keep operating.
- the baseboard management controllers 108 . . . installed in the node devices 101 . . . , respectively, are connected to a wired or wireless management LAN (a second network), and are capable of performing communication with each other via the management LAN and a network switch ( 2 ) 105 . Because the network switch ( 1 ) 104 and the network switch ( 2 ) 105 are connected by a predetermined network, the clusterwares 107 . . . and the baseboard management controllers 108 . . . can also perform communication with each other.
- the baseboard management controllers 108 . . . are denoted by reference numeral 204 .
- the baseboard management controller 204 includes a BMC node managing unit 210 , a BMC operation status acquiring unit 211 , a BMC controlling unit 212 and a hardware monitoring unit 213 as shown in FIG. 2 .
- the BMC node managing unit 210 manages a node list B 214 for holding “identifiers” and “addresses” of all of the nodes contained in the cluster system.
- the respective units 210 to 213 and the node list B 214 will be described in detail in explanation of an operation later.
- the cluster controlling unit 205 requests the node managing unit 206 to start alive monitoring of the node devices, that is, start detection of operation statuses representing whether the own node and the other nodes are normally operating or are down (are not normally operating) (step S 1 in FIG. 4 ).
- the cluster controlling unit 205 waists for notification of the operation statuses from the respective node devices (step S 2 in FIG. 4 ).
- the node managing unit 206 Upon reception of the request for alive monitoring of the operation statuses from the cluster controlling unit 205 as described above, the node managing unit 206 (a first node managing unit) requests the operation status transmitting unit 207 to notify the operation status “operating” of the own node (step S 21 in FIG. 5 ). Then, based on the addresses of all of the nodes acquired from the node list A 209 via the node managing unit 206 (step S 31 in FIG. 6 ), the operation status transmitting unit 207 notifies the operation status “operating” of the own node to all of the nodes (steps S 32 and S 33 in FIG. 6 ).
- the notification of the operation status by the operation status transmitting unit 207 is received by the operation status receiving unit 208 of each of the nodes, and the operation status receiving unit 208 notifies the notified operation status of each of the nodes to the node managing unit 206 every time receiving the notification (steps S 41 and S 42 in FIG. 7 ).
- the node managing unit 206 receives the operation status of each of the nodes from the operation status receiving unit 208 (step S 23 in FIG. 5 ), and holds as the result of detection of the operation status of each of the nodes.
- the node managing unit 206 judges a node device having not notified to the operation status receiving unit 208 for a given time or more to be in the down state, and holds the result of detection of the operation status of the node device. Because detection of the operation statuses of all of the nodes by the node managing unit 206 is executed on the OS, in such a case that the OS of any of the own node device or the other node devices is in a high-load condition, the node device cannot perform communication with the other nodes for a given time or more, and the node device is judged to be in the down state though the node device is not in the down state actually as mentioned above.
- the node managing unit 206 requests the BMC node managing unit 210 to acquire the operation statuses of all of the node devices almost in tandem with the process of detection of the operation statuses of all of the nodes executed on the OS described above (step S 22 in FIG. 5 ).
- the BMC node managing unit 210 (a second node managing unit) requests the BMC operation status acquiring unit 211 to acquire the operation statuses of all of the nodes (step S 51 in FIG. 8 ).
- the BMC operation status acquiring unit 211 acquires the operation status from the BMC operation status acquiring unit 211 of each of the nodes (steps S 62 and S 63 in FIG. 9 ).
- the acquired operation status result is notified to the BMC node managing unit 210 (step S 64 in FIG. 9 , step S 52 in FIG. 8 ), and notified from the BMC node managing unit 210 to the node managing unit 206 (step S 53 in FIG. 8 ).
- the node managing unit 206 accepts and holds the result of detection of the operation statuses of all of the nodes without being affected by the OS, via the baseboard management controller 204 (step S 24 in FIG. 5 ).
- the request to the BMC node managing unit 210 by the node managing unit 206 is periodically made, it is detected that the operation status of the own node in the BMC node managing unit 210 is a down state in a case that there is no request for a given time or more.
- the node managing unit 206 judges the operation status of the node device (step S 25 in FIG. 5 ). To be specific, the node managing unit 206 judges only a node device considered in the down state based on both the detection results to be in the down state that the node device is not normally operating actually.
- the node managing unit 206 updates the operation status of a corresponding node in the node list A 209 (step S 28 in FIG. 5 ).
- the node managing unit 206 requests the cluster controlling unit 205 to execute a system switching process (step S 29 in FIG. 5 ).
- the cluster controlling unit 205 executes the system switching process, and the node managing unit 206 comes to standby for a given time (step S 27 in FIG. 5 ).
- the hardware monitoring unit 213 monitors the hardware of the own node (step S 91 in FIG. 11 ) and, when detecting a fault, notifies to the BMC controlling unit 212 (“Yes” at step S 92 and step S 94 in FIG. 11 , step S 71 in FIG. 10 ).
- the BMC controlling unit 212 judges whether there is a need to stop the node depending on the severity of the fault, and takes the following measures.
- the BMC controlling unit 212 forcibly stops the node (step S 81 in FIG. 10 ), and notifies stoppage of the node to the cluster controlling unit 205 of the other node (step S 82 in FIG. 10 , step S 93 in FIG. 11 ).
- the cluster controlling unit 205 of the node 201 having received the notification executes system switching.
- the BMC controlling unit 212 gives an advance notice of stoppage to the cluster controlling unit 205 of the own node 201 (step S 75 in FIG. 10 , arrow Y 1 in FIG. 3 ).
- the cluster controlling unit 205 requests the BMC controlling unit 212 to stop the node (step 11 in FIG. 4 , step S 76 and “Yes” at step S 77 in FIG. 10 ), and the BMC controlling unit 212 stops the node 201 (step S 81 in FIG. 10 ).
- the cluster controlling unit 205 requests the BMC controlling unit 212 to wait for completion of the system switching, in order to inhibit the stoppage process by the BMC controlling unit 212 (“No” at step S 77 and step S 78 in FIG. 10 , step S 7 in FIG. 4 , arrow Y 2 shown in FIG. 3 ).
- the cluster controlling unit 205 executes the system switching (step S 8 in FIG. 4 ). For example, in the example shown in FIG.
- the cluster controlling unit 205 stops the service unit 106 operating in the node ( 1 ) denoted by reference numeral 101 , and executes the system switching so that the service unit 109 can move in the node ( 2 ) denoted by reference numeral 102 (arrow Y 3 and Y 4 in FIG. 3 ).
- the cluster controlling unit 205 After completion of the system switching, in order to cancel the inhibition of the stoppage process by the BMC controlling unit 212 , the cluster controlling unit 205 notifies the completion of the system switching to the BMC controlling unit 212 (step S 9 in FIG. 4 , arrow Y 5 in FIG. 3 ).
- the BMC controlling unit 212 having received the notification stops the node 201 (“No” at step S 79 and step S 80 in FIG. 10 , arrow Y 6 in FIG. 3 ). However, in a case that the system switching is not completed within a predetermined time (“Yes” at step S 79 in FIG. 10 ), the BMC controlling unit 212 forcibly stops the node 201 (step S 81 in FIG.
- step S 10 the cluster controlling unit 205 of the other node 201 having received the notification executes system switching (step S 10 , “from other node” at step S 3 , step S 4 , and step S 5 in FIG. 4 ).
- the BMC controlling unit 212 takes measures for restoration of hardware in which a fault is caused (step S 73 in FIG. 10 ).
- the cluster system detects a hardware fault in hardware monitoring by the baseboard management controller 108 that is not affected by the operation status of the OS and immediately notifies to all nodes, and hence, can immediately execute system switching in a case that a node goes down due to a hardware fault. As a result, it is possible to increase the reliability of the cluster system.
- FIGS. 12 and 13 a second exemplary embodiment of the present invention will be described with reference to FIGS. 12 and 13 . As shown in FIGS. 12 and 13 , it is possible to realize the cluster system according to the present invention in a virtual environment.
- a plurality of nodes 1105 . . . operate within a virtual infrastructure ( 1 ) 1101 , but it is enough to install only one baseboard management controller 1108 .
- Each of K nodes within a virtual infrastructure 1201 shown in FIG. 13 acquires the operation statuses of the other nodes via the same baseboard management controller 1205 without being affected by the OS.
- a node list A 1212 managed by a node managing unit 1209 has the same configuration as the node list described in the first exemplary embodiment, whereas a node list B 1217 managed by a BMC node managing unit 1213 holds the “addresses” of virtual infrastructures and the “operation statuses of nodes within each of virtual infrastructures.” Thus, it is possible to acquire the operations statuses of a plurality of nodes in bulk from one virtual infrastructure.
- a cluster system comprising a plurality of node devices, wherein each of the node devices 1 is connected with the other node devices by a first network 5 and a second network 6 , and includes:
- a second node managing unit 3 configured to operate without being affected by the operating system and detect operation statuses of the other node devices via the second network 6 ;
- a node status judging unit 4 configured to judge whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit 2 and the second node managing unit 3 .
- the node status judging unit is configured to, in a case that both the first node managing unit and the second node managing unit detect that any of the node devices is in the down state according to the preset standard, judge the node device to be in the down state.
- the cluster system comprising a cluster controlling unit configured to, in a case that the node device judged to be in the down state by the node status judging unit is executing a preset process, execute a node switching process of switching so that another of the node devices executes the preset process.
- the second node managing unit is configured to notify, to the cluster controlling unit, that the operation of the own device is due to be stopped based on the result of the monitoring;
- the cluster controlling unit is configured to receive notification that the operation of the own device is due to be stopped from the second node managing unit and, in a case that the own device is executing a preset process, execute the node switching process of switching so that another of the other node devices executes the process, and notify completion of the node switching process to the second node managing unit after the completion of the node switching process;
- the second node managing unit is configured to stop the operation of the own device after receiving notification that the node switching process by the cluster controlling unit is completed.
- a first node managing unit configured to operate on an operating system embedded in an own device and detect operation statuses of the other node devices via a first network connected to the other node devices;
- a second node managing unit configured to operate without being affected by the operating system and detect operation statuses of the other node devices via a second network connected to the other node device;
- a node status judging unit configured to judge whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- node status judging unit is configured to, in a case that both the first node managing unit and the second node managing unit detect that any of the node devices is in the down state according to the preset standard, judge the node device to be in the down state.
- a node management method comprising, in a cluster system including a plurality of node devices:
- a first node managing unit configured to operate on an operating system embedded in each of the node devices, detecting operation statuses of the other node devices via a first network connected with the other node devices;
- a second node managing unit configured to operate without being affected by the operating system embedded in the node device, detecting operation statuses of the other node devices via a second network connected with the other node device;
- each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- the program disclosed above is stored in a storage device, or recorded on a non-transitory computer-readable recording medium.
- the non-transitory computer-readable recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk and a semiconductor memory.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Environmental & Geological Engineering (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Hardware Redundancy (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
A cluster system of the present invention is a cluster system including a plurality of node devices. Each of the node devices is connected with the other node devices by a first network and a second network, and includes: a first node managing unit configured to operate on an operating system embedded in an own device and detect operation statuses of the other node devices via the first network; a second node managing unit configured to operate without being affected by the operating system and detect operation statuses of the other node devices via the second network; and a node status judging unit configured to judge whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
Description
- The present application is a continuation application of U.S. patent application Ser. No. 13/748,189 filed on Jan. 23, 2013, which claims the benefit of priority from Japanese Patent Application 2012-052640 filed on Mar. 9, 2012, the disclosures of all of which are incorporated in their entirety by reference herein.
- The present invention relates to a cluster system, specifically, a cluster system configured by a plurality of nodes and managing whether the nodes are alive.
- In recent years, the advent of cloud computing leads to an issue how to guarantee the quality of a service provided to a user by a provider, namely, how to keep SLA (Service Level Agreement). Therefore, a cluster system configured by a plurality of nodes as shown in
Patent Document 1 has a redundant configuration to, even when a node providing a service comes into a down state, take over the service to another node, thereby guaranteeing the quality of the service. On the other hand, it is also an issue for clusterware installed in such a cluster system how quickly and accurately grasp the states (operation statues, or whether a fault has occurred or not) of the nodes in order to realize higher SLA. - Alive monitoring of nodes in a cluster system is performed in a manner that the nodes check the operation states each other by using, as a communication path, something that enables the nodes to exchange information, such as a LAN (Local Area Network), serial ports or a shared disk. In a case that it is impossible to perform communication with a certain node for a given time or more, the certain node is judged to be in the down state.
- [Patent Document 1] Japanese Unexamined Patent Application Publication No. JP-A 2006-79161
- However, in the abovementioned method, a LAN, serial ports, a shared disk and so on are all controlled as management resources of an OS (Operating System), and therefore, are affected by the operation state of the OS, other than a physical fault of a communication path. For example, in a case that the OS of a specific node comes into a high load condition and cannot perform communication with other nodes for a given time or more, the specific node is considered to be in the down state by the other nodes though the node is not down actually.
- Further, in a case that a node goes down due to a hardware fault or the like, it takes a specific time or more before the node is judged to be in the down state, and therefore, it is impossible to instantly execute system switching. For example, when power interruption is caused by a CPU (Central Processing Unit) fault and one node comes into a down state, it takes a specific time or more before another node judges the one node to be in the down state.
- Thus, there is a problem for a cluster system that it is impossible to accurately and rapidly grasp the statuses of nodes, and therefore, it is impossible to rapidly switch the nodes and the reliability of the system decreases.
- Accordingly, an object of the present invention is to solve the abovementioned problem, “the reliability of the system decreases.”
- A cluster system of an exemplary embodiment of the present invention is a cluster system including a plurality of node devices.
- Each of the node devices is connected with the other node devices by a first network and a second network, and includes:
- a first node managing unit configured to operate on an operating system embedded in an own device and detect operation statuses of the other node devices via the first network;
- a second node managing unit configured to operate without being affected by the operating system and detect operation statuses of the other node devices via the second network; and
- a node status judging unit configured to judge whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- Further, a program of another exemplary embodiment of the present invention is a program for causing each of a plurality of node devices configuring a cluster system including the plurality of node devices, to realize:
- a first node managing unit configured to operate on an operating system embedded in an own device and detect operation statuses of the other node devices via a first network connected to the other node devices;
- a second node managing unit configured to operate without being affected by the operating system and detect operation statuses of the other node devices via a second network connected to the other node device; and
- a node status judging unit configured to judge whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- Further, a node management method of another exemplary embodiment of the present invention includes, in a cluster system including a plurality of node devices:
- by a first node managing unit configured to operate on an operating system embedded in each of the node devices, detecting operation statuses of the other node devices via a first network connected with the other node devices;
- by a second node managing unit configured to operate without being affected by the operating system embedded in the node device, detecting operation statuses of the other node devices via a second network connected with the other node device; and
- judging whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- With the configurations as described above, the present invention can increase the reliability of a cluster system.
-
FIG. 1 is a block diagram showing the configuration of a cluster system in a first exemplary embodiment of the present invention; -
FIG. 2 is a block diagram showing the configuration of a node configuring the cluster system disclosed inFIG. 1 ; -
FIG. 3 is an explanation diagram for explaining the operation of the cluster system disclosed inFIG. 1 ; -
FIG. 4 is a flowchart showing the operation of a cluster controlling unit of the node disclosed inFIG. 2 ; -
FIG. 5 is a flowchart showing the operation of a node managing unit of the node disclosed inFIG. 2 ; -
FIG. 6 is a flowchart showing the operation of an operation status transmitting unit of the node disclosed inFIG. 2 ; -
FIG. 7 is a flowchart showing the operation of an operation status receiving unit of the node disclosed inFIG. 2 ; -
FIG. 8 is a flowchart showing the operation of a BMC node managing unit of the node disclosed inFIG. 2 ; -
FIG. 9 is a flowchart showing the operation of a BMC operation status acquiring unit of the node disclosed inFIG. 2 ; -
FIG. 10 is a flowchart showing the operation of a BMC controlling unit of the node disclosed inFIG. 2 ; -
FIG. 11 is a flowchart showing the operation of a hardware monitoring unit of the node disclosed inFIG. 2 ; -
FIG. 12 is a block diagram showing the configuration of a cluster system in a second exemplary embodiment of the present invention -
FIG. 13 is a block diagram showing the configuration of a virtual infrastructure configuring the cluster system disclosed inFIG. 12 ; and -
FIG. 14 is a block diagram showing the configuration of a cluster system inSupplementary Note 1 of the present invention. - A cluster system (also referred to as a “cluster” hereinafter) according to the present invention includes a plurality of node devices (also referred to as “nodes” hereinafter). The respective nodes execute alive monitoring each other. The cluster system has a function to, in a case that one node comes into a down state, execute a system switching process of causing another node to restart a service having been executed by the one node. Below, the cluster system according to the present invention will be described.
- A first exemplary embodiment of the present invention will be described with reference to
FIGS. 1 to 11 .FIGS. 1 and 2 are diagrams for describing the configuration of the cluster system.FIGS. 3 to 11 are views for describing the operation of the cluster system. - As shown in
FIG. 1 , the cluster system according to the present invention includes a plurality of node devices as shown with a node (1) 101, a node (2) 102, and a node (N) 103. Thenode devices 101 . . . are each configured by an information processing device like a server computer. However, thenode devices 101 . . . may be each configured by an information processing device virtually structured as explained in a second exemplary embodiment described later. The number of thenode devices 101 . . . configuring the cluster system according to the present invention is not limited to the number thereof inFIG. 1 . - The
node devices 101 . . . described above, in each of which an operating system (also referred to as “OS” hereinafter) is embedded, hasservice units 106 . . . for performing a predetermined service process provided to users andclusterwares 107 . . . for controlling the operation of the cluster system, respectively. Theservice units 106 . . . and theclusterwares 107 . . . are structured by embedding programs into arithmetic devices installed in thenode devices 101 . . . , respectively. Hereinafter, thenode devices 101 . . . will be shown and described withreference numeral 201 inFIG. 2 . - Among the service units described above, the
service unit 106 included in thenode device 101, which is an active system among thenode devices 101 . . . configuring the cluster system, operates and provides a service process to a user. On the other hand, theservice units other node devices FIG. 1 ). Then, theclusterware 107 controls a process to start or stop theservice units 106 . . . . Therefore, in a case that theservice unit 106 is incapable of continuing to operate because of, for example, a fault of thenode device 101, theclusterware 107 executes system switching, which is switching between the active system and the standby system regarding the node devices, and the service unit included in another one of the node devices restarts the service. - The
clusterware 107, which is denoted byreference numeral 203 inFIG. 2 , includes acluster controlling unit 205, anode managing unit 206, an operationstatus transmitting unit 207 and an operationstatus receiving unit 208 as shown inFIG. 2 . Thenode managing unit 206 manages anode list A 209 for holding “identifiers,” “addresses” and “operation statuses” of all of the nodes contained in the cluster system. Therespective units 205 to 208 and thenode list A 209 will be described in detail in explanation of the operation later. - Further, as shown in
FIG. 1 , therespective node devices 101 . . . are connected to a wired or wireless LAN (Local Area Network) (a first network). Therespective node devices 101 . . . are enabled to perform communication with each other via the LAN and a network switch (1) 104 by theclusterware 107 operating on the operating system. - Furthermore, as shown in
FIG. 1 , thenode devices 101 . . . includebaseboard management controllers 108 . . . , respectively. Each of thebaseboard management controllers 108 . . . operates as firmware implemented on hardware such as a processor installed in each of thenode device 101 . . . , and operates independently of the aforementioned OS embedded in each of thenode device 101 . . . . Therefore, even when any of thenode devices 101 . . . comes to a standstill, abaseboard management controller 204 installed in each of thenode devices 101 . . . can keep operating. - The
baseboard management controllers 108 . . . installed in thenode devices 101 . . . , respectively, are connected to a wired or wireless management LAN (a second network), and are capable of performing communication with each other via the management LAN and a network switch (2) 105. Because the network switch (1) 104 and the network switch (2) 105 are connected by a predetermined network, theclusterwares 107 . . . and thebaseboard management controllers 108 . . . can also perform communication with each other. - In
FIG. 2 , thebaseboard management controllers 108 . . . are denoted byreference numeral 204. Thebaseboard management controller 204 includes a BMCnode managing unit 210, a BMC operationstatus acquiring unit 211, aBMC controlling unit 212 and ahardware monitoring unit 213 as shown inFIG. 2 . Then, the BMCnode managing unit 210 manages anode list B 214 for holding “identifiers” and “addresses” of all of the nodes contained in the cluster system. Therespective units 210 to 213 and thenode list B 214 will be described in detail in explanation of an operation later. - Next, the operation of the abovementioned node device 210 (each of the
node devices 101 . . . ) will be described with reference toFIGS. 2 to 11 . - First, the
cluster controlling unit 205 requests thenode managing unit 206 to start alive monitoring of the node devices, that is, start detection of operation statuses representing whether the own node and the other nodes are normally operating or are down (are not normally operating) (step S1 inFIG. 4 ). Thecluster controlling unit 205 waists for notification of the operation statuses from the respective node devices (step S2 inFIG. 4 ). - Upon reception of the request for alive monitoring of the operation statuses from the
cluster controlling unit 205 as described above, the node managing unit 206 (a first node managing unit) requests the operationstatus transmitting unit 207 to notify the operation status “operating” of the own node (step S21 inFIG. 5 ). Then, based on the addresses of all of the nodes acquired from thenode list A 209 via the node managing unit 206 (step S31 inFIG. 6 ), the operationstatus transmitting unit 207 notifies the operation status “operating” of the own node to all of the nodes (steps S32 and S33 inFIG. 6 ). The notification of the operation status by the operationstatus transmitting unit 207 is received by the operationstatus receiving unit 208 of each of the nodes, and the operationstatus receiving unit 208 notifies the notified operation status of each of the nodes to thenode managing unit 206 every time receiving the notification (steps S41 and S42 inFIG. 7 ). Thenode managing unit 206 receives the operation status of each of the nodes from the operation status receiving unit 208 (step S23 inFIG. 5 ), and holds as the result of detection of the operation status of each of the nodes. - The
node managing unit 206 judges a node device having not notified to the operationstatus receiving unit 208 for a given time or more to be in the down state, and holds the result of detection of the operation status of the node device. Because detection of the operation statuses of all of the nodes by thenode managing unit 206 is executed on the OS, in such a case that the OS of any of the own node device or the other node devices is in a high-load condition, the node device cannot perform communication with the other nodes for a given time or more, and the node device is judged to be in the down state though the node device is not in the down state actually as mentioned above. - Further, the
node managing unit 206 requests the BMCnode managing unit 210 to acquire the operation statuses of all of the node devices almost in tandem with the process of detection of the operation statuses of all of the nodes executed on the OS described above (step S22 inFIG. 5 ). Thus, the BMC node managing unit 210 (a second node managing unit) requests the BMC operationstatus acquiring unit 211 to acquire the operation statuses of all of the nodes (step S51 inFIG. 8 ). - Based on the addresses of all of the nodes acquired from the
node list B 214 via the BMC node managing unit 210 (step S61 inFIG. 9 ), the BMC operationstatus acquiring unit 211 acquires the operation status from the BMC operationstatus acquiring unit 211 of each of the nodes (steps S62 and S63 inFIG. 9 ). The acquired operation status result is notified to the BMC node managing unit 210 (step S64 inFIG. 9 , step S52 inFIG. 8 ), and notified from the BMCnode managing unit 210 to the node managing unit 206 (step S53 inFIG. 8 ). - Thus, by notification from the BMC
node managing unit 210, thenode managing unit 206 accepts and holds the result of detection of the operation statuses of all of the nodes without being affected by the OS, via the baseboard management controller 204 (step S24 inFIG. 5 ). Although the request to the BMCnode managing unit 210 by the node managing unit 206 (step S22 inFIG. 5 described above) is periodically made, it is detected that the operation status of the own node in the BMCnode managing unit 210 is a down state in a case that there is no request for a given time or more. - Subsequently, based on the result of detection of the operation statuses of all of the node devices executed on the OS received from the operation
status receiving unit 208 as described above and the result of detection of the operation statuses of all of the node devices executed without affected by the OS received from the BMCnode managing unit 210, the node managing unit 206 (a node status judging unit) judges the operation status of the node device (step S25 inFIG. 5 ). To be specific, thenode managing unit 206 judges only a node device considered in the down state based on both the detection results to be in the down state that the node device is not normally operating actually. - Then, in a case that there is a node device judged to be in the down state (“Yes” at step S26 in
FIG. 5 ), thenode managing unit 206 updates the operation status of a corresponding node in the node list A 209 (step S28 inFIG. 5 ). In a case that theservice unit 202 is operating in the node device judged to be in the down state, thenode managing unit 206 requests thecluster controlling unit 205 to execute a system switching process (step S29 inFIG. 5 ). After that, upon reception of the system switching request (step S3 inFIG. 4 ), thecluster controlling unit 205 executes the system switching process, and thenode managing unit 206 comes to standby for a given time (step S27 inFIG. 5 ). - Next, an operation of monitoring the hardware of a node device by the baseboard management controller 204 (a second node managing unit) will be described. The
hardware monitoring unit 213 monitors the hardware of the own node (step S91 inFIG. 11 ) and, when detecting a fault, notifies to the BMC controlling unit 212 (“Yes” at step S92 and step S94 inFIG. 11 , step S71 inFIG. 10 ). TheBMC controlling unit 212 judges whether there is a need to stop the node depending on the severity of the fault, and takes the following measures. - (1) When There Is a Need to Immediately Stop the Node (“Yes” at Step S72 and “Yes” at Step S74 in
FIG. 10 ) - The
BMC controlling unit 212 forcibly stops the node (step S81 inFIG. 10 ), and notifies stoppage of the node to thecluster controlling unit 205 of the other node (step S82 inFIG. 10 , step S93 inFIG. 11 ). In a case that theservice unit 202 of thenode 201 having been stopped has been operating, thecluster controlling unit 205 of thenode 201 having received the notification executes system switching. - (2) When There Is a Need to Stop the Node within a Predetermined Time (“Yes” at Step S72 in
FIG. 10 and “No” at Step S74 inFIG. 10 ) - The
BMC controlling unit 212 gives an advance notice of stoppage to thecluster controlling unit 205 of the own node 201 (step S75 inFIG. 10 , arrow Y1 inFIG. 3 ). In a case that theservice unit 202 is not operating in the own node (“from own node” at step S3 and “No” at step S6 inFIG. 4 ), thecluster controlling unit 205 requests theBMC controlling unit 212 to stop the node (step 11 inFIG. 4 , step S76 and “Yes” at step S77 inFIG. 10 ), and theBMC controlling unit 212 stops the node 201 (step S81 inFIG. 10 ). - Further, in a case that the
service unit 202 is operating in the own node (“Yes” at step S6 inFIG. 4 ), thecluster controlling unit 205 requests theBMC controlling unit 212 to wait for completion of the system switching, in order to inhibit the stoppage process by the BMC controlling unit 212 (“No” at step S77 and step S78 inFIG. 10 , step S7 inFIG. 4 , arrow Y2 shown inFIG. 3 ). Thecluster controlling unit 205 executes the system switching (step S8 inFIG. 4 ). For example, in the example shown inFIG. 3 , thecluster controlling unit 205 stops theservice unit 106 operating in the node (1) denoted byreference numeral 101, and executes the system switching so that theservice unit 109 can move in the node (2) denoted by reference numeral 102 (arrow Y3 and Y4 inFIG. 3 ). - After completion of the system switching, in order to cancel the inhibition of the stoppage process by the
BMC controlling unit 212, thecluster controlling unit 205 notifies the completion of the system switching to the BMC controlling unit 212 (step S9 inFIG. 4 , arrow Y5 inFIG. 3 ). TheBMC controlling unit 212 having received the notification stops the node 201 (“No” at step S79 and step S80 inFIG. 10 , arrow Y6 inFIG. 3 ). However, in a case that the system switching is not completed within a predetermined time (“Yes” at step S79 inFIG. 10 ), theBMC controlling unit 212 forcibly stops the node 201 (step S81 inFIG. 10 ), and notifies the stoppage of thenode 201 to thecluster controlling unit 205 of the other node 201 (step S82 inFIG. 10 ). Thecluster controlling unit 205 of theother node 201 having received the notification executes system switching (step S10, “from other node” at step S3, step S4, and step S5 inFIG. 4 ). - (3) When There Is No Need to Stop the Node (“No” at step S72 in
FIG. 10 ) - The
BMC controlling unit 212 takes measures for restoration of hardware in which a fault is caused (step S73 inFIG. 10 ). - Thus, in the cluster system according to the present invention, alive monitoring of nodes by the
baseboard management controller 108 is not affected by the operation status of the OS, so that it is possible, even when a node comes into a state incapable of performing communication with another node due to the operation status of the OS, to accurately grasp the operation status of the node. Therefore, it is possible to avoid judging that a node device is in the down state, and it is possible to increase the reliability of the cluster system. - Further, the cluster system according to the present invention detects a hardware fault in hardware monitoring by the
baseboard management controller 108 that is not affected by the operation status of the OS and immediately notifies to all nodes, and hence, can immediately execute system switching in a case that a node goes down due to a hardware fault. As a result, it is possible to increase the reliability of the cluster system. - Next, a second exemplary embodiment of the present invention will be described with reference to
FIGS. 12 and 13 . As shown inFIGS. 12 and 13 , it is possible to realize the cluster system according to the present invention in a virtual environment. - In a virtual environment, as shown in
FIG. 12 , a plurality ofnodes 1105 . . . operate within a virtual infrastructure (1) 1101, but it is enough to install only onebaseboard management controller 1108. Each of K nodes within avirtual infrastructure 1201 shown inFIG. 13 acquires the operation statuses of the other nodes via the samebaseboard management controller 1205 without being affected by the OS. - A
node list A 1212 managed by a node managing unit 1209 has the same configuration as the node list described in the first exemplary embodiment, whereas anode list B 1217 managed by a BMC node managing unit 1213 holds the “addresses” of virtual infrastructures and the “operation statuses of nodes within each of virtual infrastructures.” Thus, it is possible to acquire the operations statuses of a plurality of nodes in bulk from one virtual infrastructure. - The whole or part of the exemplary embodiments disclosed above can be described as the following supplementary notes. Below, the outline of configurations of a cluster system (refer to
FIG. 14 ), a program and a node management method according to the present invention will be described. However, the present invention is not limited to the following configurations. - A cluster system comprising a plurality of node devices, wherein each of the
node devices 1 is connected with the other node devices by afirst network 5 and asecond network 6, and includes: -
- a first
node managing unit 2 configured to operate on an operating system embedded in an own device and detect operation statuses of the other node devices via thefirst network 5;
- a first
- a second
node managing unit 3 configured to operate without being affected by the operating system and detect operation statuses of the other node devices via thesecond network 6; and - a node
status judging unit 4 configured to judge whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the firstnode managing unit 2 and the secondnode managing unit 3. - The cluster system according to
Supplementary Note 1, wherein the node status judging unit is configured to, in a case that both the first node managing unit and the second node managing unit detect that any of the node devices is in the down state according to the preset standard, judge the node device to be in the down state. - The cluster system according to
Supplementary Note 2, comprising a cluster controlling unit configured to, in a case that the node device judged to be in the down state by the node status judging unit is executing a preset process, execute a node switching process of switching so that another of the node devices executes the preset process. - The cluster system according to
Supplementary Note 3, wherein the second node managing unit is configured to operate without being affected by the operating system and monitor an operation status of hardware installed in the own device and, depending on a result of the monitoring, stop operation of the own device. - The cluster system according to
Supplementary Note 4, wherein: - the second node managing unit is configured to notify, to the cluster controlling unit, that the operation of the own device is due to be stopped based on the result of the monitoring;
- the cluster controlling unit is configured to receive notification that the operation of the own device is due to be stopped from the second node managing unit and, in a case that the own device is executing a preset process, execute the node switching process of switching so that another of the other node devices executes the process, and notify completion of the node switching process to the second node managing unit after the completion of the node switching process; and
- the second node managing unit is configured to stop the operation of the own device after receiving notification that the node switching process by the cluster controlling unit is completed.
- A program for causing each of a plurality of node devices configuring a cluster system including the plurality of node devices, to realize:
- a first node managing unit configured to operate on an operating system embedded in an own device and detect operation statuses of the other node devices via a first network connected to the other node devices;
- a second node managing unit configured to operate without being affected by the operating system and detect operation statuses of the other node devices via a second network connected to the other node device; and
- a node status judging unit configured to judge whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- The program according to
Supplementary Note 6, wherein the node status judging unit is configured to, in a case that both the first node managing unit and the second node managing unit detect that any of the node devices is in the down state according to the preset standard, judge the node device to be in the down state. - A node management method comprising, in a cluster system including a plurality of node devices:
- by a first node managing unit configured to operate on an operating system embedded in each of the node devices, detecting operation statuses of the other node devices via a first network connected with the other node devices;
- by a second node managing unit configured to operate without being affected by the operating system embedded in the node device, detecting operation statuses of the other node devices via a second network connected with the other node device; and
- judging whether each of the node devices is in a down state according to a preset standard, based on results of the detection of the other node devices by the first node managing unit and the second node managing unit.
- The node management method according to Supplementary Note 8, comprising:
- in a case that both the first node managing unit and the second node managing unit detect that any of the node devices is in the down state according to the preset standard, judging the node device to be in the down state.
- The program disclosed above is stored in a storage device, or recorded on a non-transitory computer-readable recording medium. For example, the non-transitory computer-readable recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk and a semiconductor memory.
- Although the present invention has been described above with reference to the aforementioned exemplary embodiments, the present invention is not limited to the exemplary embodiments. The configurations and details of the present invention can be altered in various manners that can be understood by those skilled in the art within the scope of the present invention.
Claims (9)
1. A cluster system comprising a plurality of node devices, wherein the node devices are clustered each other via at least one network, and include:
a controller configured to detect operation statuses of the other node devices, the controller being installed in the node device and managing the node device; and
a processor configured to detect operation statuses of the other node devices, and to determine whether one of the other node devices is in a down state according to a predefined rule based on results of the detection of the one of the other node devices by the controller and the processor.
2. The cluster system according to claim 1 , wherein the processor is configured to, in a case that both the controller and the processor detect that any of the node devices is in the down state, determine the node device to be in the down state.
3. The cluster system according to claim 2 , wherein the processor is configured to, in a case that the node device determined to be in the down state is executing a preset process, execute a node switching process of switching so that another of the node devices executes the preset process.
4. The cluster system according to claim 3 , wherein the controller is configured to operate without being affected by an operating system installed in the own device and monitor an operation status of hardware installed in the own device and, depending on a result of the monitoring, stop operation of the own device.
5. The cluster system according to claim 4 , wherein:
the controller is configured to notify, to the processor, that the operation of the own device is due to be stopped based on the result of the monitoring;
the processor is configured to receive notification that the operation of the own device is due to be stopped from the controller and, in a case that the own device is executing a preset process, execute the node switching process of switching so that another of the other node devices executes the process, and notify completion of the node switching process to the controller after the completion of the node switching process; and
the controller is configured to stop the operation of the own device after receiving notification that the node switching process by the processor is completed.
6. A non-transitory computer-readable storage medium storing a program comprising instructions for causing a plurality of node devices configuring a cluster system including the plurality of node devices clustered each other via at least one network, to realize:
a controller configured to detect operation statuses of the other node devices, the controller being installed in the node device and managing the node device; and
a processor configured to detect operation statuses of the other node devices, and to determine whether one of the other node devices is in a down state according to a predefined rule based on results of the detection of the one of the other node devices by the controller and the processor.
7. The non-transitory computer-readable storage medium storing the program according to claim 6 , wherein the processor is configured to, in a case that both the controller and the processor detect that any of the node devices is in the down state, determine the node device to be in the down state.
8. A node management method comprising, in a cluster system including a plurality of node devices clustered each other via at least one network:
by a controller, detecting operation statuses of the other node devices, the controller being installed in the node device and managing the node device; and
by a processor, detecting operation statuses of the other node devices, and determining whether one of the other node devices is in a down state according to a predefined rule based on results of the detection of the one of the other node devices by the controller and the processor.
9. The node management method according to claim 8 , comprising:
in a case that both the controller and the processor detect that any of the node devices is in the down state, determining the node device to be in the down state.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/879,253 US20160036654A1 (en) | 2012-03-09 | 2015-10-09 | Cluster system |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2012052640A JP6007522B2 (en) | 2012-03-09 | 2012-03-09 | Cluster system |
JP2012-052640 | 2012-03-09 | ||
US13/748,189 US9210059B2 (en) | 2012-03-09 | 2013-01-23 | Cluster system |
US14/879,253 US20160036654A1 (en) | 2012-03-09 | 2015-10-09 | Cluster system |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/748,189 Continuation US9210059B2 (en) | 2012-03-09 | 2013-01-23 | Cluster system |
Publications (1)
Publication Number | Publication Date |
---|---|
US20160036654A1 true US20160036654A1 (en) | 2016-02-04 |
Family
ID=47747342
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/748,189 Expired - Fee Related US9210059B2 (en) | 2012-03-09 | 2013-01-23 | Cluster system |
US14/879,253 Abandoned US20160036654A1 (en) | 2012-03-09 | 2015-10-09 | Cluster system |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/748,189 Expired - Fee Related US9210059B2 (en) | 2012-03-09 | 2013-01-23 | Cluster system |
Country Status (6)
Country | Link |
---|---|
US (2) | US9210059B2 (en) |
EP (1) | EP2637102B1 (en) |
JP (1) | JP6007522B2 (en) |
CN (1) | CN103312767A (en) |
BR (1) | BR102013005401A2 (en) |
IN (1) | IN2013CH00960A (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102170720B1 (en) * | 2013-10-30 | 2020-10-27 | 삼성에스디에스 주식회사 | Apparatus and Method for Changing Status of Clustered Nodes, and recording medium recording the program thereof |
CN105681070A (en) * | 2014-11-21 | 2016-06-15 | 中芯国际集成电路制造(天津)有限公司 | Method and system for automatically collecting and analyzing computer cluster node information |
CN105988908B (en) * | 2015-02-04 | 2018-11-06 | 昆达电脑科技(昆山)有限公司 | The global data processing system of single BMC multiservers |
JP6424134B2 (en) * | 2015-04-23 | 2018-11-14 | 株式会社日立製作所 | Computer system and computer system control method |
US10157115B2 (en) * | 2015-09-23 | 2018-12-18 | Cloud Network Technology Singapore Pte. Ltd. | Detection system and method for baseboard management controller |
CN107025151A (en) * | 2016-01-30 | 2017-08-08 | 鸿富锦精密工业(深圳)有限公司 | Electronic installation connects system |
JP6838334B2 (en) * | 2016-09-26 | 2021-03-03 | 日本電気株式会社 | Cluster system, server, server operation method, and program |
CN107247564B (en) * | 2017-07-17 | 2021-02-02 | 苏州浪潮智能科技有限公司 | Method and system for data processing |
CN114218004A (en) * | 2021-12-15 | 2022-03-22 | 上海道客网络科技有限公司 | Method and system for fault handling of physical nodes of Kubernetes cluster based on BMC |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050251567A1 (en) * | 2004-04-15 | 2005-11-10 | Raytheon Company | System and method for cluster management based on HPC architecture |
US20060232575A1 (en) * | 2003-09-25 | 2006-10-19 | Nielsen Christen V | Methods and apparatus to detect an operating state of a display based on visible light |
US20100185894A1 (en) * | 2009-01-20 | 2010-07-22 | International Business Machines Corporation | Software application cluster layout pattern |
US20110131318A1 (en) * | 2009-05-26 | 2011-06-02 | Oracle International Corporation | High availability enabler |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6277656A (en) * | 1985-05-22 | 1987-04-09 | Nec Corp | Program debugging system |
JPH08185379A (en) * | 1994-12-29 | 1996-07-16 | Nec Corp | Parallel processing system |
US5805785A (en) * | 1996-02-27 | 1998-09-08 | International Business Machines Corporation | Method for monitoring and recovery of subsystems in a distributed/clustered system |
US6212573B1 (en) * | 1996-06-26 | 2001-04-03 | Sun Microsystems, Inc. | Mechanism for invoking and servicing multiplexed messages with low context switching overhead |
US6308282B1 (en) * | 1998-11-10 | 2001-10-23 | Honeywell International Inc. | Apparatus and methods for providing fault tolerance of networks and network interface cards |
US6581166B1 (en) * | 1999-03-02 | 2003-06-17 | The Foxboro Company | Network fault detection and recovery |
US6862613B1 (en) * | 2000-01-10 | 2005-03-01 | Sun Microsystems, Inc. | Method and apparatus for managing operations of clustered computer systems |
US7627694B2 (en) * | 2000-03-16 | 2009-12-01 | Silicon Graphics, Inc. | Maintaining process group membership for node clusters in high availability computing systems |
US7149918B2 (en) * | 2003-03-19 | 2006-12-12 | Lucent Technologies Inc. | Method and apparatus for high availability distributed processing across independent networked computer fault groups |
JP4339763B2 (en) | 2004-09-07 | 2009-10-07 | 株式会社日立製作所 | Failover method and computer system |
JP4246248B2 (en) * | 2005-11-11 | 2009-04-02 | 富士通株式会社 | Network monitor program, information processing method, and computer executed in cluster system computer |
JP2008152552A (en) * | 2006-12-18 | 2008-07-03 | Hitachi Ltd | Computer system and failure information management method |
US7850260B2 (en) * | 2007-06-22 | 2010-12-14 | Oracle America, Inc. | Injection/ejection mechanism |
CN101594383B (en) * | 2009-07-09 | 2012-05-23 | 浪潮电子信息产业股份有限公司 | Method for monitoring service and status of controllers of double-controller storage system |
JP2011191854A (en) * | 2010-03-12 | 2011-09-29 | Hitachi Ltd | Computer system, control method of computer system, and program |
CN102137017B (en) * | 2011-03-17 | 2013-10-09 | 华为技术有限公司 | Working method and device for virtual network unit |
CN102231681B (en) * | 2011-06-27 | 2014-07-30 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
US9100320B2 (en) * | 2011-12-30 | 2015-08-04 | Bmc Software, Inc. | Monitoring network performance remotely |
-
2012
- 2012-03-09 JP JP2012052640A patent/JP6007522B2/en not_active Expired - Fee Related
-
2013
- 2013-01-22 EP EP13152199.9A patent/EP2637102B1/en not_active Not-in-force
- 2013-01-23 US US13/748,189 patent/US9210059B2/en not_active Expired - Fee Related
- 2013-03-06 IN IN960CH2013 patent/IN2013CH00960A/en unknown
- 2013-03-06 BR BR102013005401A patent/BR102013005401A2/en not_active Application Discontinuation
- 2013-03-07 CN CN2013100731727A patent/CN103312767A/en active Pending
-
2015
- 2015-10-09 US US14/879,253 patent/US20160036654A1/en not_active Abandoned
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060232575A1 (en) * | 2003-09-25 | 2006-10-19 | Nielsen Christen V | Methods and apparatus to detect an operating state of a display based on visible light |
US20050251567A1 (en) * | 2004-04-15 | 2005-11-10 | Raytheon Company | System and method for cluster management based on HPC architecture |
US20100185894A1 (en) * | 2009-01-20 | 2010-07-22 | International Business Machines Corporation | Software application cluster layout pattern |
US20110131318A1 (en) * | 2009-05-26 | 2011-06-02 | Oracle International Corporation | High availability enabler |
Non-Patent Citations (1)
Title |
---|
Landry Pre-Grant Publication no US 2011/0154005 A1 * |
Also Published As
Publication number | Publication date |
---|---|
CN103312767A (en) | 2013-09-18 |
IN2013CH00960A (en) | 2015-08-14 |
EP2637102B1 (en) | 2020-06-17 |
JP6007522B2 (en) | 2016-10-12 |
US20130238787A1 (en) | 2013-09-12 |
JP2013186781A (en) | 2013-09-19 |
US9210059B2 (en) | 2015-12-08 |
EP2637102A1 (en) | 2013-09-11 |
BR102013005401A2 (en) | 2017-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9210059B2 (en) | Cluster system | |
CN107526659B (en) | Method and apparatus for failover | |
JP5851503B2 (en) | Providing high availability for applications in highly available virtual machine environments | |
EP3142011B9 (en) | Anomaly recovery method for virtual machine in distributed environment | |
US9189316B2 (en) | Managing failover in clustered systems, after determining that a node has authority to make a decision on behalf of a sub-cluster | |
US8065560B1 (en) | Method and apparatus for achieving high availability for applications and optimizing power consumption within a datacenter | |
US10924538B2 (en) | Systems and methods of monitoring software application processes | |
US11223515B2 (en) | Cluster system, cluster system control method, server device, control method, and non-transitory computer-readable medium storing program | |
US20170228250A1 (en) | Virtual machine service availability | |
US9049101B2 (en) | Cluster monitor, method for monitoring a cluster, and computer-readable recording medium | |
CN113254245A (en) | Fault detection method and system for storage cluster | |
CN107071189B (en) | Connection method of communication equipment physical interface | |
CN111897681A (en) | Message forwarding method and device, computing equipment and storage medium | |
WO2014050493A1 (en) | Backup device, main device, redundancy configuration system, and load dispersion method | |
TWM432075U (en) | Monitoring device and monitoring system applicable to cloud algorithm | |
US11954509B2 (en) | Service continuation system and service continuation method between active and standby virtual servers | |
KR20140140719A (en) | Apparatus and system for synchronizing virtual machine and method for handling fault using the same | |
JP2018056633A (en) | Cluster system, server, operation method for server, and program | |
CN117201507A (en) | Cloud platform switching method and device, electronic equipment and storage medium | |
JP2013025765A (en) | Master/slave system, control device, master/slave switching method and master/slave switching program | |
US11010269B2 (en) | Distributed processing system and method for management of distributed processing system | |
US20230289203A1 (en) | Server maintenance control device, server maintenance system, server maintenance control method, and program | |
CN107783855B (en) | Fault self-healing control device and method for virtual network element | |
JP2013156963A (en) | Control program, control method, information processing apparatus, and control system | |
US11150980B2 (en) | Node device, recovery operation control method, and non-transitory computer readable medium storing recovery operation control program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |