Raft-based cluster support #2472

svagner · 2020-04-23T21:43:50Z

Description

Cluster implementation for bosun.
The goal of this change is to add high availability for bosun service. Currently, it’s possible to have only one active bosun node running checks and sending an alert. If this node goes down or becomes unavailable there’s a high chance that checks won’t run for some time, alerts are not fired, incidents aren’t created and notifications — delivered. It then takes some time for a sysadmin to switch all the checks to another node. So, in the end, we’re minimizing human intervention needed in failure scenarios and reducing possible downtime of service.

Changes overview
We've implemented the cluster for bosun.
Cluster would only have one ‘leader’ at a time, all other nodes are followers (so this is an implementation of a model with 1 master and multiple standby nodes).
‘Leader’ node executes the checks and sends notifications, ‘Follower’ nodes can response to web UI queries and API queries. 'Follower' node can send notifications but only generated from API/UI queries.

As replacement for #2345 and #2441

Fixes #2443

Type of change

From the following, please check the options that are relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How has this been tested?

TestAlertRunner_ClusterLeader
TestAlertRunner_ClusterFollower
TestCheckNotify_Cluster_FollowerState
TestCheckNotify_Cluster_LeaderState
TestClusterEnabledFollover_AlertRun

Checklist:

This contribution follows the project's code of conduct
This contribution follows the project's contributing guidelines
My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

Cluster would only have one ‘leader’ at a time, all other nodes are followers (so this is an implementation of a model with with 1 master and multiple standby nodes). ‘Master’ node executes the checks and sends notifications, ‘follower’ nodes don’t do neither (they run with ‘no-checks’ and ‘quiet-mode’ options enabled). This also adds a new (optional) dependency raftdb to store state and perform leader election.

For now, we are looking to global variable that was initialized once we've started. If we want to have flexibility to restart scheduler (config api reload/clustering etc.) we should have it as time of scheduler's start

- check raft state within scheduler to prevent lose events - sync rules configuration between nodes within cluster - add snapshots/logs management within raft cluster

- redis latency/count/errors - notifications latency/count/errors - incident state change

stale · 2021-04-18T22:16:49Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

svagner added 26 commits April 23, 2020 23:31

Fix test config flag with cluster usage

8854545

Add Cluster timeouts settings

93e65e5

Add possibility to define members in the separate memberlist file

5d23b9a

Fix check unknowns while we restart scheduler

b3ff770

For now, we are looking to global variable that was initialized once we've started. If we want to have flexibility to restart scheduler (config api reload/clustering etc.) we should have it as time of scheduler's start

Rebase with current master version

afcdbcc

Update raft dependency to version 1.1.2

ad693e9

Fix unknowns after reload scheduler/start bosun after long downtime

450c9b5

Increase reload channel size for cluster

0fdbaa1

Reimplement raft cluster approach

b60ba33

- check raft state within scheduler to prevent lose events - sync rules configuration between nodes within cluster - add snapshots/logs management within raft cluster

Add prometheus statistics support

cfdb6b2

Add UI for cluster management

e6282ec

Add nodeID support for raft cluster

fa18b0d

Fix handle raft errors

18fcd3b

Add stats for raft apply log errors

76f8c7f

Add support for failover nodes from slaves

02a7eca

Add support to disable sync rules config within cluster

ac5c0fb

Fix unknown while cluster failover

d94dbcf

Fix unknowns while transition from follower to leader

6392922

Fix permissions for cluster mmanagement page

01d497a

Fix cluster management UI

3d3c9b7

Fix cluster recovery. Fix cluster management UI

d989ad4

Fix changing persistent data from follower nodes

9aa44fc

Add extra prometheus metrics

7932afb

- redis latency/count/errors - notifications latency/count/errors - incident state change

Add go-metrics to collect internal raft metrics

6396900

Add tests for cluster. Update documentation.

e1d2c4d

This was referenced Apr 23, 2020

First cluster implementation #2441

Closed

First cluster implementation #2345

Closed

Feature request: Clustering support #2443

Closed

stale bot added the wontfix label Apr 18, 2021

stale bot closed this May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Raft-based cluster support #2472

Raft-based cluster support #2472

Uh oh!

svagner commented Apr 23, 2020

Uh oh!

stale bot commented Apr 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Raft-based cluster support #2472

Raft-based cluster support #2472

Uh oh!

Conversation

svagner commented Apr 23, 2020

Description

Type of change

How has this been tested?

Checklist:

Uh oh!

stale bot commented Apr 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant