Short description
We should be able to configure how many times we should skip an alert execution within bosun before changing the alert state to unknown.
Currently, if bosun was not able to execute alert because one-time problem e.g. Redis was unavailable, Redis was failed over to another node we are starting to generate unknown incidents.
Example of such behavior:
2020/04/14 13:57:36 info: check.go:590: check alert haproxy_backend start with now set to 2020-04-14 11:57:36.04166142
2020/04/14 13:57:36 info: check.go:664: check alert haproxy_backend done (646.941211ms): 105 untouched_sum, 0 crits, 0 warns, 0 unevaluated, 0 unknown, 0 untouched/unknown, 0 errors
2020/04/14 13:57:36 info: alertRunner.go:119: runHistory on haproxy_backend took 32.316722ms
2020/04/14 14:02:37 info: check.go:590: check alert haproxy_backend start with now set to 2020-04-14 12:02:37.371733479
2020/04/14 14:02:38 info: check.go:664: check alert haproxy_backend done (827.396128ms): 105 untouched_sum, 0 crits, 0 warns, 0 unevaluated, 0 unknown, 0 untouched/unknown, 0 errors
2020/04/14 14:02:38 error: check.go:96: Error in runHistory for haproxy_backend{host=host1}. state_data.go:187: NOREPLICAS Not enough good slaves to write..
... errors while failover ...
2020/04/14 14:07:38 info: check.go:590: check alert haproxy_backend start with now set to 2020-04-1 4 12:07:38.492328499
2020/04/14 14:07:39 info: check.go:664: check alert haproxy_backend done (732.655348ms): 105 untouched_sum, 0 crits, 0 warns, 0 unevaluated, 0 unknown, 105 untouched/unknown, 0 errors
Ps The extra logs data from #2415
How this feature will help you/your organization
This feature can help to avoid a dozen of the unknown notifications after one-time problems on the bosun side.
Possible solution or implementation details
#2471