-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Description
Discussed in #8490
Originally posted by thomasdarimont September 27, 2021
The book “Distributed Systems Observability'' by Cindy Sridharan describes logs, distributed tracing and metrics as
essential telemetry types to monitor an application in production, which are also known as the “three pillars of observability”.
Currently Keycloak does not provide metrics out of the box and users who want to have metrics need to use extensions like
the aerogear keycloak-metrics-spi or implement their own metrics collection based on the smallrye-metrics support provided by the Wildfly and JBoss EAP runtimes.
It would be very helpful for operations teams if Keycloak had a compelling set of useful metrics built-in.
The goal of this discussion is to shape the metrics part of Keycloak’s observability story with focus on Keycloak.X and to compile the foundation for a new metrics design document.
Metrics
A metrics based monitoring of a Keycloak system could consist of interesting metrics that are relevant for different
audiences like operations and SRE, as well as product teams.
Some of those metrics provide information about different layers of a system, including:
- Process: CPU, Memory consumption, open file descriptors
- JVM: memory, threads, classloading, metadata (java version)
- Datasources: connection pool stats, metadata (database version)
- HTTP server: request count per path/status code /latency distribution
- JGroups: cluster communication stats
- Infinispan: cache stats
-
Integrations: outbound http request count / latency distribution-> consider tracing instead to analyze latencies and errors. - Server: metrics collection duration, metadata (keycloak version)
- Keycloak: authentication stats, authorization stats
- Keycloak: inventory stats
Keycloak Metrics
The application layer of a Keycloak system can provide many different metrics that could be arranged in a set of
logical domains. Some of the following metrics might be coarse grained while others could be broken down further
by additional context data, e.g. realm, error_code, client_id, authenticator_execution, or protocol.
The following list serves as an example for high-level metrics that could theoretically be provided by Keycloak
at some point in time.
The metrics listed below are based on an earlier discussion about a compilation of metrics for Keycloak.
Model Metrics
Represents the system inventory, and denotes how many items of a particular type exist in the system.
This helps to keep an eye on the growth of the system.
Example metrics:
- #Realms
- #Users per Realm
- #Clients per Realm
- #Groups per Realm
- #Scopes per Realm
Authentication Metrics
Represents authentication activity for users and clients.
Example metrics:
- #Logins
- #Login Errors
- #Logouts
- #Logout Errors
- #Login duration histogram
- #Client Login
- #Client Login Errors
- #Required Action Executions
- #Required Action Errors
- #Unique AuthenticationFlowSequence Executions (Username -> Password -> 2FA vs. Username -> Password)
Authorization Metrics
Represents Authorization activity collected for the authz services.
Example metrics:
- #Access Requested
- #Access Granted
- #Access Denied
User Metrics
Represents information about users and their metadata.
Example metrics:
- #Users by realm
- #Users by status blocked / locked / disabled
- #User with missing information (email, phoneNumber, address)
- #User with unverified information (email, phoneNumber, address)
- #Distribution of credentials
- #Groups by realm
- #Consents by client / type
- #New Users in interval (yesterday, last week, last month, last year)
Client Metrics
Represents information about clients and their metadata.
Example metrics:
- #Clients by realm / protocol / type / enabled / disabled
OIDC Protocol Usage Metrics
Usage information about the OIDC protocol
Example metrics:
- #Token Requests
- #Token Request Errors
- #Refreshes
- #Refresh Errors
- #UserInfo Requests
- #UserInfo Request Errors
- #Token Exchanges
- #Token Exchanges Errors
- Token generation duration distribution by token type (by protocol mapper?)
- UserInfo generation duration distribution (by protocol mapper?)
SAML Protocol Usage Metrics
Usage information about the SAML protocol
Example metrics:
- #AuthnRequests
- #AuthnRequest Errors
- Assertion generation duration distribution (by protocol mapper?)
Federation Metrics
Information about user federation
Example metrics:
- #User lookups in storage
- #User lookup errors in storage
Identity Brokering Metrics
Information about Identity Brokering
Example metrics:
- #Brokered user logins
- #Brokered user login errors
Inbound / Endpoint Metrics
- #Inbound (HTTP) request/response by status / path / protocol
- #Inbound (HTTP) request/response latency distribution
In micrometer those are usually captured by the dimensional metric http.server.requests{uri=...,status=...,...}
.
Outbound Metrics
- #Outbound request/response by status / path / protocol / destination
- #Outbound request/response latency distribution
In micrometer those are usually captured by the dimensional metric http.client.requests{uri=...,status=...,...}
.
Instance Metrics
Represents general information and metadata about the server.
Some of those “metrics” are just simple gauges with a dummy value that exposes the actual metadata via labels.
Example metrics:
- Server Version
- Enabled features
- Metrics Collection duration
- #Exceptions by realm / exception class / cause
Metrics Infrastructure
The Wildfly and JBoss EAP based Keycloak / RH SSO distributions use SmallRye metrics for their runtime metrics collection.
However the Quarkus team recommends using micrometer for a while now for custom metric collection. In order to follow this approach we will focus on micrometer based metrics for the new metrics support in Keycloak.X.
OS, Process and JVM based metrics are usually provided by the base metric libraries.
In our case the micrometer library provides a set of useful JVM and system metrics out of the box: https://micrometer.io/docs/ref/jvm
The micrometer Keycloak metrics SPI provides some additional metrics that could be useful.
Metrics instrumentation
Keycloak provides several ways to collect metrics synchronously, e.g.: event listeners, JAX-RS / container specific filters and HTTP client interceptors. Metrics that are more expensive to compute could be collected
asynchronously by a dedicated metrics service that can execute datastore specific queries.
Collected metrics could either be directly stored in the micrometer metric registry or buffered in an own data structure that periodically releases the metrics into an underlying registry.
Explicitly computed metrics could be represented as Gauges
that are explicitly updated.
Counted metrics like number of logins or failed logins could be recorded via Counters
that are updated via
event listeners or request filters / interceptors.
Metrics around HTTP request processing should capture information about the request path, status code
and request durations. Additionally request duration recording should allow to track latency profiles.
Keycloak could provide components that enable metrics collection on multiple levels:
- MetricsEventListener: an event listener could update metrics based on user events or admin events
- MetricsFilter: a server request / response filter could update request specific metrics
- MetricsInterceptor: a client request / response filter could update request specific metrics
- MetricsCollector: a metrics service could periodically run a compute metrics based on registered computation rules
Those metric components should access a shared metric registry, which holds the metadata and state that is eventually exposed by dedicated metric endpoints.
Keycloak Metrics
Initial Metrics Selection
Although many of the metrics mentioned above provide valuable insights about a Keycloak system, we should focus
on a small initial subset of metrics that are provided out of the box.
Built-in Metrics
Some core metrics should be built-in to Keycloak and provide some configuration options, like whether the metric is collected at all, or the granularity, e.g. additional tags, labels to add.
Custom Metrics
Some of the metrics mentioned above could be provided out of the box by Keycloak, however there will be use-cases that can not be foreseen, which require the ability to contribute custom metrics to the system.
For this Keycloak needs to provide a metrics SPI that enables users to add their own custom metrics.
Metrics Configuration
We should have a way to let users control which metrics are collected / tracked by Keycloak.
Users should be able to control things like:
- Which metrics to enable?
- Which tags to emit alongside the metric?
- Which context information to include? (e.g. request parts or selected parameters)
Exposing Metrics
Metrics need to be accessible for metric collection tools like Prometheus or InfluxDB. Those tools usually fetch metrics information from an HTTP endpoint. For this we could either provide one global metrics endpoint for the whole server and all realms or realm specific endpoints that can be consumed by the collectors. This model is supported by Quarkus out of the box via the /q/metrics
global endpoint. This endpoint could then contain information about the process, jvm, instance, as well as all the Keycloak application metrics.
However in environments where a Keycloak system is shared among multiple different parties, e.g. a collection of realms per tenant model, users might only be allowed to access a subset of the metrics information via realm specific endpoints that provides only metrics for a particular realm. In this case an endpoint like /auth/realm/$myrealm/metrics
could be used as a realm specific endpoint that only provides the Keycloak application metrics and perhaps a small subset of server metadata.
Note, that it should be possible to protect the endpoints which expose realm metrics.
Metrics SPI
A metrics SPI should allow users to contribute new metrics to the Keycloak metrics collection.
The registered metrics could hook into the metrics collection infrastructure described above.
Links
Questions
- Which metrics are important for you? What other metrics would you like to see?
- What questions do you want to solve based on metrics and which metrics would support you here?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status