-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Description
Before reporting an issue
- I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.
Area
infinispan
Describe the bug
When starting two nodes concurrently, there can be two coordinators for an initial period of time.
Version
23.x
Regression
- The issue is a regression
Expected behavior
There should be only one coordinator at a time
Actual behavior
There are two coordinators
How to Reproduce?
As reported by @thomasdarimont
In environments where customers deploy Keycloak via AWS Fargate, I sporadically see situations where two Keycloak instances in a cluster register themselves as an Infinispan coordintor in the jgroups_ping table.
Has anyone here seen this or similar situations?
The following examples are from a local docker compose example.
...
kc1-1 | 2025-07-18 19:20:34,313 INFO [org.jgroups.protocols.pbcast.GMS] (main) kc1-35658: no members discovered after 2 ms: creating cluster as coordinator
kc2-1 | 2025-07-18 19:20:34,314 INFO [org.jgroups.JChannel] (main) local_addr: 6fcea7e2-3e50-4239-88a2-e3a529cdd27a, name: kc2-31834
kc2-1 | 2025-07-18 19:20:34,321 INFO [org.jgroups.protocols.FD_SOCK2] (main) server listening on *:57800
kc2-1 | 2025-07-18 19:20:34,324 INFO [org.jgroups.protocols.pbcast.GMS] (main) kc2-31834: no members discovered after 2 ms: creating cluster as coordinator
kc1-1 | 2025-07-18 19:20:34,325 INFO [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [kc1-35658|0] (1) [kc1-35658]
kc1-1 | 2025-07-18 19:20:34,327 INFO [org.keycloak.jgroups.certificates.CertificateReloadManager] (main) Reloading JGroups Certificate
kc2-1 | 2025-07-18 19:20:34,336 INFO [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [kc2-31834|0] (1) [kc2-31834]
...
This situation heals itself after a few seconds / (sometimes) minutes:
kc1-1 | 2025-07-18 19:21:22,368 INFO [org.infinispan.CLUSTER] () ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[kc1-35658|1] (2) [kc1-35658, kc2-31834], 2 subgroups: [kc1-35658|0] (1) [kc1-35658], [kc2-31834|0] (1) [kc2-31834]
kc1-1 | 2025-07-18 19:21:22,368 INFO [org.keycloak.jgroups.certificates.CertificateReloadManager] () Reloading JGroups Certificate
kc1-1 | 2025-07-18 19:21:22,372 INFO [org.infinispan.CLUSTER] () ISPN100000: Node kc2-31834 joined the cluster
kc1-1 | 2025-07-18 19:21:22,372 INFO [org.infinispan.CLUSTER] () ISPN100000: Node kc2-31834 joined the cluster
kc2-1 | 2025-07-18 19:21:23,380 INFO [org.infinispan.CLUSTER] () ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[kc1-35658|1] (2) [kc1-35658, kc2-31834], 2 subgroups: [kc1-35658|0] (1) [kc1-35658], [kc2-31834|0] (1) [kc2-31834]
kc2-1 | 2025-07-18 19:21:23,380 INFO [org.keycloak.jgroups.certificates.CertificateReloadManager] () Reloading JGroups Certificate
kc2-1 | 2025-07-18 19:21:23,385 INFO [org.infinispan.CLUSTER] () ISPN100000: Node kc1-35658 joined the cluster
kc2-1 | 2025-07-18 19:21:23,386 INFO [org.infinispan.CLUSTER] () ISPN100000: Node kc1-35658 joined the cluster
...
The problem ist that leads to all sorts of issues. Since both Keycloak instances are coordinator, the some maintenance tasks which should run only on one instance in the cluster are executed by multiple instances.
See: org.keycloak.storage.datastore.DefaultDatastoreProviderFactory#scheduleTask
One problem that I observed was that cleanup tasks lock some rows in the offline_user_session / offline_client_session table if it contains a high number of records. This leads to records being locked and user requests that managed to aquire a connection which want to access a "uncached" offline / online user session are effectively blocked due to the locks. If too many request are block, then they slowly start exhausting the database connection pool. After that Keycloak stops accepting new requests and becomes unresponsive. The next thing that happens is that the other Keycloak instances in the cluster start noticing that and receiver timeouts during the infinispan synchronization. This eventually can lead to a complete cluster failure and downtime of there is too much traffic on the system.
I wonder if it would make sense to add some additional checks to org.keycloak.jgroups.protocol.KEYCLOAK_JDBC_PING2 to detect that there is already another coordinator and restart the infinispan initialization to avoid those situations.
Anything else?
Local test:
volumes:
postgres_data:
driver: local
services:
postgres:
image: postgres:15
volumes:
- postgres_data:/var/lib/postgresql/data
environment:
POSTGRES_DB: keycloak
POSTGRES_USER: keycloak
POSTGRES_PASSWORD: password
ports:
- 5433:5432
kc1:
# image: quay.io/keycloak/keycloak:26.3.1
image: customkeycloak
# build:
# context: .
hostname: kc1
environment:
# DEBUG: "true"
# DEBUG_PORT: "*:8787"
KC_BOOTSTRAP_ADMIN_USERNAME: admin
KC_BOOTSTRAP_ADMIN_PASSWORD: admin
KC_DB_SCHEMA: public
KC_DB_USERNAME: keycloak
KC_DB_PASSWORD: password
KC_DB_URL: jdbc:postgresql://postgres/keycloak
KC_HOSTNAME_STRICT: "false"
KC_HTTP_ENABLED: "true"
#KC_LOG_LEVEL: INFO,org.infinispan:DEBUG,org.jgroups:DEBUG
KC_LOG_LEVEL: INFO,org.keycloak.services.scheduled:DEBUG,org.keycloak.storage.datastore.DefaultDatastoreProviderFactory:DEBUG
ports:
- 8080:8080
- 8787:8787
- 8443:8443
# volumes:
# - ./keycloak-benchmark-dataset-999.0.0-SNAPSHOT.jar:/opt/keycloak/providers/keycloak-benchmark-dataset.jar:z
depends_on:
- postgres
kc2:
image: customkeycloak
# build:
# context: .
hostname: kc2
environment:
KC_BOOTSTRAP_ADMIN_USERNAME: admin
KC_BOOTSTRAP_ADMIN_PASSWORD: admin
KC_DB_SCHEMA: public
KC_DB_USERNAME: keycloak
KC_DB_PASSWORD: password
KC_DB_URL: jdbc:postgresql://postgres/keycloak
KC_HOSTNAME_STRICT: "false"
KC_HTTP_ENABLED: "true"
KC_LOG_LEVEL: INFO,org.keycloak.services.scheduled:DEBUG,org.keycloak.storage.datastore.DefaultDatastoreProviderFactory:DEBUG
#KC_LOG_LEVEL: INFO,org.infinispan:DEBUG,org.jgroups:DEBUG
ports:
- 18080:8080
- 18443:8443
depends_on:
- postgres
kc3:
image: customkeycloak
# build:
# context: .
hostname: kc3
environment:
KC_BOOTSTRAP_ADMIN_USERNAME: admin
KC_BOOTSTRAP_ADMIN_PASSWORD: admin
KC_DB_SCHEMA: public
KC_DB_USERNAME: keycloak
KC_DB_PASSWORD: password
KC_DB_URL: jdbc:postgresql://postgres/keycloak
KC_HOSTNAME_STRICT: "false"
KC_HTTP_ENABLED: "true"
KC_LOG_LEVEL: INFO,org.keycloak.services.scheduled:DEBUG,org.keycloak.storage.datastore.DefaultDatastoreProviderFactory:DEBUG
#KC_LOG_LEVEL: INFO,org.infinispan:DEBUG,org.jgroups:DEBUG
ports:
- 28080:8080
- 28443:8443
depends_on:
- postgres
kc4:
image: customkeycloak
# build:
# context: .
hostname: kc4
environment:
KC_BOOTSTRAP_ADMIN_USERNAME: admin
KC_BOOTSTRAP_ADMIN_PASSWORD: admin
KC_DB_SCHEMA: public
KC_DB_USERNAME: keycloak
KC_DB_PASSWORD: password
KC_DB_URL: jdbc:postgresql://postgres/keycloak
KC_HOSTNAME_STRICT: "false"
KC_HTTP_ENABLED: "true"
KC_LOG_LEVEL: INFO,org.keycloak.services.scheduled:DEBUG,org.keycloak.storage.datastore.DefaultDatastoreProviderFactory:DEBUG
#KC_LOG_LEVEL: INFO,org.infinispan:DEBUG,org.jgroups:DEBUG
ports:
- 38080:8080
- 38443:8443
depends_on:
- postgres
ARG KC_VERSION=26.3.1
FROM quay.io/keycloak/keycloak:$KC_VERSION AS builder
ENV KC_METRICS_ENABLED=true
ENV KC_HEALTH_ENABLED=true
ENV KC_FEATURES=preview
ENV KC_DB=postgres
ENV KC_HTTP_RELATIVE_PATH=/auth
RUN /opt/keycloak/bin/kc.sh build
FROM quay.io/keycloak/keycloak:$KC_VERSION
COPY --from=builder /opt/keycloak/lib/quarkus/ /opt/keycloak/lib/quarkus/
COPY --from=builder /opt/keycloak/conf/ /opt/keycloak/conf/
WORKDIR /opt/keycloak
ENV KC_METRICS_ENABLED=true
ENV KC_HEALTH_ENABLED=true
ENV KC_FEATURES=preview
ENV KC_DB=postgres
ENV KC_HTTP_RELATIVE_PATH=/auth
# for demonstration purposes only, please make sure to use proper certificates in production instead
RUN keytool -genkeypair -storepass password -storetype PKCS12 -keyalg RSA -keysize 2048 -dname "CN=server" -alias server -ext "SAN:c=DNS:localhost,IP:127.0.0.1" -keystore conf/server.keystore
ENTRYPOINT ["/opt/keycloak/bin/kc.sh", "start"]
docker build -t customkeycloak .
docker compose up postgres
docker compose up kc1 kc2
docker compose up kc3 kc4