Issue: Inconsistent Cluster Health Metrics (minio_cluster_health_nodes_online_count v3 vs. minio_cluster_nodes_offline_total v2) #21577
Replies: 1 comment
-
Hi Team, Please use below addition info:
RELEASE.2025-07-23T15-54-02Z (commit-id=7ced9663e6a791fef9dc6be798ff24cda9c730ac)
This issue intermittently affects our MinIO cluster monitoring and alerting. We are tracking cluster health using Grafana with VictoriaMetrics and MinIO’s v2/v3 metrics endpoints. During two short intervals, minio_cluster_health_nodes_online_count (v3) shows the cluster as not healthy, minio_cluster_nodes_offline_total (v2) registers servers offline, while other health/disk/node-related metrics display normal “healthy” status, and there are no disk errors in OS logs (dmesg, syslog). This impacts our ability to reliably alert on true disk or node failure and causes confusion about cluster state. We need to understand why these metrics briefly become inconsistent and how to distinguish genuine cluster faults from transient metric discrepancies. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team.
When monitoring my MinIO cluster, I see that:
minio_cluster_health_nodes_online_count (v3 metric) intermittently reports nodes as not healthy.
At the same time, minio_cluster_nodes_offline_total (v2 metric) shows the expected count of offline servers.
All other cluster health, disk, and node metrics (including disk online/offline counts and other node metrics) appear healthy and show up normally.
There is no disk, IO, mount, or hardware error in underlying OS logs (checked via dmesg/syslog).
Steps to reproduce:
Observe minio_cluster_health_nodes_online_count and minio_cluster_nodes_offline_total in Grafana/VictoriaMetrics.
Notice spikes in v3 “not healthy” at times where v2 “offline” appears, but other metrics (disk/node and application logs) report healthy states.
Beta Was this translation helpful? Give feedback.
All reactions