Issue: Inconsistent Cluster Health Metrics (minio_cluster_health_nodes_online_count v3 vs. minio_cluster_nodes_offline_total v2) #21577

nitindhiman314e · 2025-09-09T07:13:01Z

nitindhiman314e
Sep 9, 2025

Hi Team.

When monitoring my MinIO cluster, I see that:

minio_cluster_health_nodes_online_count (v3 metric) intermittently reports nodes as not healthy.
At the same time, minio_cluster_nodes_offline_total (v2 metric) shows the expected count of offline servers.
All other cluster health, disk, and node metrics (including disk online/offline counts and other node metrics) appear healthy and show up normally.

There is no disk, IO, mount, or hardware error in underlying OS logs (checked via dmesg/syslog).

Steps to reproduce:

Observe minio_cluster_health_nodes_online_count and minio_cluster_nodes_offline_total in Grafana/VictoriaMetrics.
Notice spikes in v3 “not healthy” at times where v2 “offline” appears, but other metrics (disk/node and application logs) report healthy states.

nitindhiman314e · 2025-09-12T05:18:44Z

nitindhiman314e
Sep 12, 2025
Author

Hi Team, Please use below addition info:

minio version

RELEASE.2025-07-23T15-54-02Z (commit-id=7ced9663e6a791fef9dc6be798ff24cda9c730ac)
Runtime: go1.24.5 linux/amd64
License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
Copyright: 2015-2025 MinIO, Inc.

Operating System and Version:
Linux 6.8.0-78-generic -Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 12 11:34:18 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Context

This issue intermittently affects our MinIO cluster monitoring and alerting. We are tracking cluster health using Grafana with VictoriaMetrics and MinIO’s v2/v3 metrics endpoints. During two short intervals,

minio_cluster_health_nodes_online_count (v3) shows the cluster as not healthy,

minio_cluster_nodes_offline_total (v2) registers servers offline,

while other health/disk/node-related metrics display normal “healthy” status, and there are no disk errors in OS logs (dmesg, syslog). This impacts our ability to reliably alert on true disk or node failure and causes confusion about cluster state. We need to understand why these metrics briefly become inconsistent and how to distinguish genuine cluster faults from transient metric discrepancies.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue: Inconsistent Cluster Health Metrics (minio_cluster_health_nodes_online_count v3 vs. minio_cluster_nodes_offline_total v2) #21577

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Issue: Inconsistent Cluster Health Metrics (minio_cluster_health_nodes_online_count v3 vs. minio_cluster_nodes_offline_total v2) #21577

Uh oh!

nitindhiman314e Sep 9, 2025

Replies: 1 comment

Uh oh!

Uh oh!

nitindhiman314e Sep 12, 2025 Author

nitindhiman314e
Sep 9, 2025

nitindhiman314e
Sep 12, 2025
Author