+
Skip to content

Keycloak 26.3.3 - Infinispan cluster issues #43367

@sstojak1

Description

@sstojak1

Before reporting an issue

  • I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.

Area

infinispan

Describe the bug

We are running Keycloak in Docker containers on Linux VMs in cluster mode.
During our switch from TCP_PING to JDBC_PING, we started experiencing intermittent issues during deployments.

After some troubleshooting, we discovered that Infinispan occasionally encounters problems when forming a cluster. These issues typically occur when one or more cluster members are redeployed. The symptoms are not consistent — they vary from one deployment to another.

Example scenario:
We have four Keycloak instances expected to form a cluster: keycloak-1, keycloak-2, keycloak-3, and keycloak-4.
At the time of restart, keycloak-1 is the coordinator.

If we restart keycloak-2, we observe the following log on keycloak-1:

2025-10-10 14:02:21.584 stdout 2025-10-10 14:02:21,583 TRACE [org.jgroups.blocks.cs.NioServer] (NioServer.Selector [/0.0.0.0:57800]-3,keycloak-1-56712) keycloak-1-ip-address:57800: removed connection to keycloak-2-ip-address:57800

On keycloak-2 (the restarted instance), we can see that it becomes a singleton node:

2025-10-10 14:02:53.879 stdout 2025-10-10 14:02:53,878 WARN  [org.jgroups.protocols.pbcast.GMS] (main) keycloak-1-899: too many JOIN attempts (10): becoming singleton
2025-10-10 14:02:53.880 stdout 2025-10-10 14:02:53,880 DEBUG [org.jgroups.protocols.pbcast.NAKACK2] (main) 
2025-10-10 14:02:53.880 stdout [keycloak-1-899 setDigest()]
2025-10-10 14:02:53.880 stdout existing digest:  []
2025-10-10 14:02:53.880 stdout new digest:       keycloak-1-899: [0 (0)]
2025-10-10 14:02:53.880 stdout resulting digest: keycloak-1-899: [0 (0)]

For some unknown reason, keycloak-3 now becomes the new coordinator:

2025-10-10 14:02:53.704 stdout 2025-10-10 14:02:53,704 DEBUG [org.jgroups.protocols.pbcast.GMS] (VERIFY_SUSPECT2.Runner-1) keycloak-3-26562: members are (4) keycloak-1-56712,keycloak-3,keycloak-2-43087,keycloak-1-29717, coord=keycloak-3-26562: I'm the new coordinator

A few seconds later, Infinispan fails to recover the cluster state on keycloak-3:

2025-10-10 14:02:59.782 stdout 2025-10-10 14:02:59,782 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (timeout-thread--p4-t1) ISPN000196: Failed to recover cluster state after the current node became the coordinator (or after merge), will retry: org.infinispan.commons.TimeoutException: ISPN000476: Timed out waiting for responses for request 96 from keycloak-1-29717 after 6 seconds

These are the last logs on the initial coordinator (keycloak-1):

2025-10-10 14:02:11.755 stdout 2025-10-10 14:02:11,755 INFO  [org.infinispan.CLUSTER] () [Context=offlineClientSessions] ISPN100002: Starting rebalance with members [keycloak-1-56712, keycloak-3-26562, keycloak-4-43087], phase READ_OLD_WRITE_ALL, topology id 15
2025-10-10 14:02:11.757 stdout 2025-10-10 14:02:11,757 TRACE [org.jgroups.protocols.pbcast.NAKACK2] () keycloak-1-56712 --> [all]: #177
2025-10-10 14:02:11.757 stdout 2025-10-10 14:02:11,757 TRACE [org.jgroups.protocols.TCP] () keycloak-1-56712: sending msg to null, src=keycloak-1-56712, size=4522, headers are NAKACK2: [MSG, seqno=177], TP: [cluster=ISPN]
2025-10-10 14:02:11.757 stdout 2025-10-10 14:02:11,757 TRACE [org.jgroups.protocols.MFC] () keycloak-1-56712 used 4474 credits, 3391245 remaining
2025-10-10 14:02:21.584 stdout 2025-10-10 14:02:21,583 TRACE [org.jgroups.blocks.cs.NioServer] (NioServer.Selector [/0.0.0.0:57800]-3,keycloak-1-56712) 10.41.1.252:57800: removed connection to keycloak-2-ip-address:57800
2025-10-10 14:02:21.590 stdout 2025-10-10 14:02:21,589 TRACE [org.jgroups.protocols.FD_SOCK2] (NioServer.Selector [/0.0.0.0:57800]-3,keycloak-1-56712) keycloak-1-56712: CONNECT <-- keycloak-4-43087
2025-10-10 14:02:21.590 stdout 2025-10-10 14:02:21,590 TRACE [org.jgroups.protocols.FD_SOCK2] (NioServer.Selector [/0.0.0.0:57800]-3,keycloak-1-56712) keycloak-1-56712: CONNECT-RSP[cluster=ISPN, srv=keycloak-1-56712] --> keycloak-4-43087

End result:
Both keycloak-1 and keycloak-2 become unresponsive.
The JGROUPS_PING table shows an invalid state — for example, keycloak-1 remains listed even though it’s unresponsive.
The state never changes, and the cluster remains stuck in this condition.

Image

Version

26.3.3

Regression

  • The issue is a regression

Expected behavior

Cluster to recover or not end in this state during cluster member redeploy

Actual behavior

already described

How to Reproduce?

already described

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载