+
Skip to content

Concurrent starts with JDBC_PING lead to a split cluster #41290

@ahus1

Description

@ahus1

Before reporting an issue

  • I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.

Area

infinispan

Describe the bug

When starting two nodes concurrently, there can be two coordinators for an initial period of time.

Version

23.x

Regression

  • The issue is a regression

Expected behavior

There should be only one coordinator at a time

Actual behavior

There are two coordinators

How to Reproduce?

As reported by @thomasdarimont

In environments where customers deploy Keycloak via AWS Fargate, I sporadically see situations where two Keycloak instances in a cluster register themselves as an Infinispan coordintor in the jgroups_ping table.
Has anyone here seen this or similar situations?
The following examples are from a local docker compose example.

...
kc1-1  | 2025-07-18 19:20:34,313 INFO  [org.jgroups.protocols.pbcast.GMS] (main) kc1-35658: no members discovered after 2 ms: creating cluster as coordinator
kc2-1  | 2025-07-18 19:20:34,314 INFO  [org.jgroups.JChannel] (main) local_addr: 6fcea7e2-3e50-4239-88a2-e3a529cdd27a, name: kc2-31834
kc2-1  | 2025-07-18 19:20:34,321 INFO  [org.jgroups.protocols.FD_SOCK2] (main) server listening on *:57800
kc2-1  | 2025-07-18 19:20:34,324 INFO  [org.jgroups.protocols.pbcast.GMS] (main) kc2-31834: no members discovered after 2 ms: creating cluster as coordinator
kc1-1  | 2025-07-18 19:20:34,325 INFO  [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [kc1-35658|0] (1) [kc1-35658]
kc1-1  | 2025-07-18 19:20:34,327 INFO  [org.keycloak.jgroups.certificates.CertificateReloadManager] (main) Reloading JGroups Certificate
kc2-1  | 2025-07-18 19:20:34,336 INFO  [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [kc2-31834|0] (1) [kc2-31834]
...

This situation heals itself after a few seconds / (sometimes) minutes:

kc1-1  | 2025-07-18 19:21:22,368 INFO  [org.infinispan.CLUSTER] () ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[kc1-35658|1] (2) [kc1-35658, kc2-31834], 2 subgroups: [kc1-35658|0] (1) [kc1-35658], [kc2-31834|0] (1) [kc2-31834]
kc1-1  | 2025-07-18 19:21:22,368 INFO  [org.keycloak.jgroups.certificates.CertificateReloadManager] () Reloading JGroups Certificate
kc1-1  | 2025-07-18 19:21:22,372 INFO  [org.infinispan.CLUSTER] () ISPN100000: Node kc2-31834 joined the cluster
kc1-1  | 2025-07-18 19:21:22,372 INFO  [org.infinispan.CLUSTER] () ISPN100000: Node kc2-31834 joined the cluster
kc2-1  | 2025-07-18 19:21:23,380 INFO  [org.infinispan.CLUSTER] () ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[kc1-35658|1] (2) [kc1-35658, kc2-31834], 2 subgroups: [kc1-35658|0] (1) [kc1-35658], [kc2-31834|0] (1) [kc2-31834]
kc2-1  | 2025-07-18 19:21:23,380 INFO  [org.keycloak.jgroups.certificates.CertificateReloadManager] () Reloading JGroups Certificate
kc2-1  | 2025-07-18 19:21:23,385 INFO  [org.infinispan.CLUSTER] () ISPN100000: Node kc1-35658 joined the cluster
kc2-1  | 2025-07-18 19:21:23,386 INFO  [org.infinispan.CLUSTER] () ISPN100000: Node kc1-35658 joined the cluster
...

The problem ist that leads to all sorts of issues. Since both Keycloak instances are coordinator, the some maintenance tasks which should run only on one instance in the cluster are executed by multiple instances.
See: org.keycloak.storage.datastore.DefaultDatastoreProviderFactory#scheduleTask
One problem that I observed was that cleanup tasks lock some rows in the offline_user_session / offline_client_session table if it contains a high number of records. This leads to records being locked and user requests that managed to aquire a connection which want to access a "uncached" offline / online user session are effectively blocked due to the locks. If too many request are block, then they slowly start exhausting the database connection pool. After that Keycloak stops accepting new requests and becomes unresponsive. The next thing that happens is that the other Keycloak instances in the cluster start noticing that and receiver timeouts during the infinispan synchronization. This eventually can lead to a complete cluster failure and downtime of there is too much traffic on the system.
I wonder if it would make sense to add some additional checks to org.keycloak.jgroups.protocol.KEYCLOAK_JDBC_PING2 to detect that there is already another coordinator and restart the infinispan initialization to avoid those situations.

Image Image

Anything else?

Local test:

volumes:
  postgres_data:
    driver: local

services:
  postgres:
    image: postgres:15
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: keycloak
      POSTGRES_USER: keycloak
      POSTGRES_PASSWORD: password
    ports:
      - 5433:5432

  kc1:
    # image: quay.io/keycloak/keycloak:26.3.1
    image: customkeycloak
    #    build:
    #      context: .
    hostname: kc1
    environment:
#      DEBUG: "true"
#      DEBUG_PORT: "*:8787"
      KC_BOOTSTRAP_ADMIN_USERNAME: admin
      KC_BOOTSTRAP_ADMIN_PASSWORD: admin
      KC_DB_SCHEMA: public
      KC_DB_USERNAME: keycloak
      KC_DB_PASSWORD: password
      KC_DB_URL: jdbc:postgresql://postgres/keycloak
      KC_HOSTNAME_STRICT: "false"
      KC_HTTP_ENABLED: "true"
      #KC_LOG_LEVEL: INFO,org.infinispan:DEBUG,org.jgroups:DEBUG
      KC_LOG_LEVEL: INFO,org.keycloak.services.scheduled:DEBUG,org.keycloak.storage.datastore.DefaultDatastoreProviderFactory:DEBUG

    ports:
      - 8080:8080
      - 8787:8787
      - 8443:8443

#    volumes:
#      - ./keycloak-benchmark-dataset-999.0.0-SNAPSHOT.jar:/opt/keycloak/providers/keycloak-benchmark-dataset.jar:z

    depends_on:
      - postgres

  kc2:
    image: customkeycloak
    #    build:
    #      context: .
    hostname: kc2
    environment:
      KC_BOOTSTRAP_ADMIN_USERNAME: admin
      KC_BOOTSTRAP_ADMIN_PASSWORD: admin
      KC_DB_SCHEMA: public
      KC_DB_USERNAME: keycloak
      KC_DB_PASSWORD: password
      KC_DB_URL: jdbc:postgresql://postgres/keycloak
      KC_HOSTNAME_STRICT: "false"
      KC_HTTP_ENABLED: "true"
      KC_LOG_LEVEL: INFO,org.keycloak.services.scheduled:DEBUG,org.keycloak.storage.datastore.DefaultDatastoreProviderFactory:DEBUG
      #KC_LOG_LEVEL: INFO,org.infinispan:DEBUG,org.jgroups:DEBUG

    ports:
      - 18080:8080
      - 18443:8443

    depends_on:
      - postgres

  kc3:
    image: customkeycloak
    #    build:
    #      context: .
    hostname: kc3
    environment:
      KC_BOOTSTRAP_ADMIN_USERNAME: admin
      KC_BOOTSTRAP_ADMIN_PASSWORD: admin
      KC_DB_SCHEMA: public
      KC_DB_USERNAME: keycloak
      KC_DB_PASSWORD: password
      KC_DB_URL: jdbc:postgresql://postgres/keycloak
      KC_HOSTNAME_STRICT: "false"
      KC_HTTP_ENABLED: "true"
      KC_LOG_LEVEL: INFO,org.keycloak.services.scheduled:DEBUG,org.keycloak.storage.datastore.DefaultDatastoreProviderFactory:DEBUG
      #KC_LOG_LEVEL: INFO,org.infinispan:DEBUG,org.jgroups:DEBUG

    ports:
      - 28080:8080
      - 28443:8443

    depends_on:
      - postgres


  kc4:
    image: customkeycloak
    #    build:
    #      context: .
    hostname: kc4
    environment:
      KC_BOOTSTRAP_ADMIN_USERNAME: admin
      KC_BOOTSTRAP_ADMIN_PASSWORD: admin
      KC_DB_SCHEMA: public
      KC_DB_USERNAME: keycloak
      KC_DB_PASSWORD: password
      KC_DB_URL: jdbc:postgresql://postgres/keycloak
      KC_HOSTNAME_STRICT: "false"
      KC_HTTP_ENABLED: "true"
      KC_LOG_LEVEL: INFO,org.keycloak.services.scheduled:DEBUG,org.keycloak.storage.datastore.DefaultDatastoreProviderFactory:DEBUG
      #KC_LOG_LEVEL: INFO,org.infinispan:DEBUG,org.jgroups:DEBUG

    ports:
      - 38080:8080
      - 38443:8443

    depends_on:
      - postgres
ARG KC_VERSION=26.3.1
FROM quay.io/keycloak/keycloak:$KC_VERSION AS builder

ENV KC_METRICS_ENABLED=true
ENV KC_HEALTH_ENABLED=true
ENV KC_FEATURES=preview
ENV KC_DB=postgres
ENV KC_HTTP_RELATIVE_PATH=/auth

RUN /opt/keycloak/bin/kc.sh build

FROM quay.io/keycloak/keycloak:$KC_VERSION
COPY --from=builder /opt/keycloak/lib/quarkus/ /opt/keycloak/lib/quarkus/
COPY --from=builder /opt/keycloak/conf/ /opt/keycloak/conf/
WORKDIR /opt/keycloak

ENV KC_METRICS_ENABLED=true
ENV KC_HEALTH_ENABLED=true
ENV KC_FEATURES=preview
ENV KC_DB=postgres
ENV KC_HTTP_RELATIVE_PATH=/auth

# for demonstration purposes only, please make sure to use proper certificates in production instead
RUN keytool -genkeypair -storepass password -storetype PKCS12 -keyalg RSA -keysize 2048 -dname "CN=server" -alias server -ext "SAN:c=DNS:localhost,IP:127.0.0.1" -keystore conf/server.keystore

ENTRYPOINT ["/opt/keycloak/bin/kc.sh", "start"]
docker build -t customkeycloak .

docker compose up postgres
docker compose up kc1 kc2
docker compose up kc3 kc4

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载