-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Description
Before reporting an issue
- I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.
Area
infinispan
Describe the bug
We are running Keycloak in Docker containers on Linux VMs in cluster mode.
During our switch from TCP_PING to JDBC_PING, we started experiencing intermittent issues during deployments.
After some troubleshooting, we discovered that Infinispan occasionally encounters problems when forming a cluster. These issues typically occur when one or more cluster members are redeployed. The symptoms are not consistent — they vary from one deployment to another.
Example scenario:
We have four Keycloak instances expected to form a cluster: keycloak-1, keycloak-2, keycloak-3, and keycloak-4.
At the time of restart, keycloak-1 is the coordinator.
If we restart keycloak-2, we observe the following log on keycloak-1:
2025-10-10 14:02:21.584 stdout 2025-10-10 14:02:21,583 TRACE [org.jgroups.blocks.cs.NioServer] (NioServer.Selector [/0.0.0.0:57800]-3,keycloak-1-56712) keycloak-1-ip-address:57800: removed connection to keycloak-2-ip-address:57800
On keycloak-2 (the restarted instance), we can see that it becomes a singleton node:
2025-10-10 14:02:53.879 stdout 2025-10-10 14:02:53,878 WARN [org.jgroups.protocols.pbcast.GMS] (main) keycloak-1-899: too many JOIN attempts (10): becoming singleton
2025-10-10 14:02:53.880 stdout 2025-10-10 14:02:53,880 DEBUG [org.jgroups.protocols.pbcast.NAKACK2] (main)
2025-10-10 14:02:53.880 stdout [keycloak-1-899 setDigest()]
2025-10-10 14:02:53.880 stdout existing digest: []
2025-10-10 14:02:53.880 stdout new digest: keycloak-1-899: [0 (0)]
2025-10-10 14:02:53.880 stdout resulting digest: keycloak-1-899: [0 (0)]
For some unknown reason, keycloak-3 now becomes the new coordinator:
2025-10-10 14:02:53.704 stdout 2025-10-10 14:02:53,704 DEBUG [org.jgroups.protocols.pbcast.GMS] (VERIFY_SUSPECT2.Runner-1) keycloak-3-26562: members are (4) keycloak-1-56712,keycloak-3,keycloak-2-43087,keycloak-1-29717, coord=keycloak-3-26562: I'm the new coordinator
A few seconds later, Infinispan fails to recover the cluster state on keycloak-3:
2025-10-10 14:02:59.782 stdout 2025-10-10 14:02:59,782 WARN [org.infinispan.topology.ClusterTopologyManagerImpl] (timeout-thread--p4-t1) ISPN000196: Failed to recover cluster state after the current node became the coordinator (or after merge), will retry: org.infinispan.commons.TimeoutException: ISPN000476: Timed out waiting for responses for request 96 from keycloak-1-29717 after 6 seconds
These are the last logs on the initial coordinator (keycloak-1):
2025-10-10 14:02:11.755 stdout 2025-10-10 14:02:11,755 INFO [org.infinispan.CLUSTER] () [Context=offlineClientSessions] ISPN100002: Starting rebalance with members [keycloak-1-56712, keycloak-3-26562, keycloak-4-43087], phase READ_OLD_WRITE_ALL, topology id 15
2025-10-10 14:02:11.757 stdout 2025-10-10 14:02:11,757 TRACE [org.jgroups.protocols.pbcast.NAKACK2] () keycloak-1-56712 --> [all]: #177
2025-10-10 14:02:11.757 stdout 2025-10-10 14:02:11,757 TRACE [org.jgroups.protocols.TCP] () keycloak-1-56712: sending msg to null, src=keycloak-1-56712, size=4522, headers are NAKACK2: [MSG, seqno=177], TP: [cluster=ISPN]
2025-10-10 14:02:11.757 stdout 2025-10-10 14:02:11,757 TRACE [org.jgroups.protocols.MFC] () keycloak-1-56712 used 4474 credits, 3391245 remaining
2025-10-10 14:02:21.584 stdout 2025-10-10 14:02:21,583 TRACE [org.jgroups.blocks.cs.NioServer] (NioServer.Selector [/0.0.0.0:57800]-3,keycloak-1-56712) 10.41.1.252:57800: removed connection to keycloak-2-ip-address:57800
2025-10-10 14:02:21.590 stdout 2025-10-10 14:02:21,589 TRACE [org.jgroups.protocols.FD_SOCK2] (NioServer.Selector [/0.0.0.0:57800]-3,keycloak-1-56712) keycloak-1-56712: CONNECT <-- keycloak-4-43087
2025-10-10 14:02:21.590 stdout 2025-10-10 14:02:21,590 TRACE [org.jgroups.protocols.FD_SOCK2] (NioServer.Selector [/0.0.0.0:57800]-3,keycloak-1-56712) keycloak-1-56712: CONNECT-RSP[cluster=ISPN, srv=keycloak-1-56712] --> keycloak-4-43087
End result:
Both keycloak-1 and keycloak-2 become unresponsive.
The JGROUPS_PING table shows an invalid state — for example, keycloak-1 remains listed even though it’s unresponsive.
The state never changes, and the cluster remains stuck in this condition.
Version
26.3.3
Regression
- The issue is a regression
Expected behavior
Cluster to recover or not end in this state during cluster member redeploy
Actual behavior
already described
How to Reproduce?
already described
Anything else?
No response