这是indexloc提供的服务,不要输入任何密码
Skip to content

MDEV-36009: Systemd: Restart on OOM #4182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: 10.6
Choose a base branch
from

Conversation

fredericve
Copy link

Per systemd/systemd#36529 OOM counts as a on-abnormal condition. To ensure that MariaDB testart on OOM the Restart is changes to on-abnormal which an extension on the current on-abort condition.

  • The Jira issue number for this PR is: MDEV-36009

Description

The previous PR was only ported to up to 10.11. This PR ports it to 10.6 which is in ubuntu jammy.

Per systemd/systemd#36529 OOM counts
as a on-abnormal condition. To ensure that MariaDB testart on
OOM the Restart is changes to on-abnormal which an extension
on the current on-abort condition.
@grooverdan
Copy link
Member

other PR #3611.

Ok by me, but @vuvova is the 10.6 gatekeeper.

@grooverdan grooverdan added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jul 10, 2025
@vuvova
Copy link
Member

vuvova commented Jul 14, 2025

what does "timeout" mean in the MariaDB case?
@grooverdan ?

@grooverdan
Copy link
Member

Timeout is when the start or stop time exceeds the 900 second(15min) + any extend timeout (usually 60 seconds) messages sent. So long crash recovery should be normally ok, because the extend timeout sends periodic messages, however extreme stalled hardware or CPU could cause a mismatch between mariadb advised timeout extensions and the ability to achieve progress in that time.

Galera joiner/donor SST threads also operate on extending timeout (though I'm a bit concerned on their implementation look at it - too close to limit).

Surprising me here also is that stop timeouts are included in when to restart. Shutdown timeout is extended with innodb buffer pool dump every 1k pages, shutdown innodb flushing, innodb redo log writing during shutdown, waiting for innodb transactions to finish, encryption thread termination, page cleaner thread termination, change buffer merge. With the number of extensions here I'm doubting a shutdown timeout could occur.

@vuvova
Copy link
Member

vuvova commented Jul 15, 2025

May be, let's increase the timeout to be more certain it won't happen? Restarting on startup timeout sounds like a sure way to create an infinite restart loop

@grooverdan
Copy link
Member

I rechecked galera SST - its all good there.

Summary:

There's no real gain in extending the start or stop timeout. InnoDB recovery is already extending the timeouts predictably.

Its more important that this extension is covered correctly than increasing the 15minute start.

Long version:

This is what a startup look like with a large innodb crash recovery where the start/stop timeout was reduced to 40 seconds:

Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] Starting MariaDB 10.11.11-MariaDB source revision e69f8cae1a15e15b9e4f5e0f8497e1f17bdc81a4 server_uid enNnnQxEDTHTmtOjZVL8QXFFEN0= as process 136468
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Warning] mariadbd: Couldn't allocate 130023424 bytes (Large/HugeTLB memory page size 2097152); errno 12; continuing to smaller size
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] InnoDB: Compressed tables use zlib 1.3.1.zlib-ng
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] InnoDB: Number of transaction pools: 1
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] InnoDB: Using AVX512 instructions
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] InnoDB: Using Linux native AIO
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] InnoDB: Initializing buffer pool, total size = 10.000GiB, chunk size = 160.000MiB
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] InnoDB: Initialized memory pressure event listener
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] InnoDB: Completed initialization of buffer pool
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
Jul 17 17:36:45 bark mariadbd[136468]: 2025-07-17 17:36:45 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=245008712
Jul 17 17:37:00 bark mariadbd[136468]: 2025-07-17 17:37:00 0 [Note] InnoDB: Parsed redo log up to LSN=2373580178; to recover: 106361 pages
Jul 17 17:37:00 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=Parsed redo log up to LSN=2373580178; to recover: 106361 pages, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:37:15 bark mariadbd[136468]: 2025-07-17 17:37:15 0 [Note] InnoDB: Parsed redo log up to LSN=4124633710; to recover: 193099 pages
Jul 17 17:37:15 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=Parsed redo log up to LSN=4124633710; to recover: 193099 pages, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:37:30 bark mariadbd[136468]: 2025-07-17 17:37:30 0 [Note] InnoDB: Parsed redo log up to LSN=5710046515; to recover: 271631 pages
Jul 17 17:37:30 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=Parsed redo log up to LSN=5710046515; to recover: 271631 pages, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:37:35 bark mariadbd[136468]: 2025-07-17 17:37:35 0 [Note] InnoDB: Multi-batch recovery needed at LSN 6252420442
Jul 17 17:37:38 bark mariadbd[136468]: 2025-07-17 17:37:38 0 [Note] InnoDB: End of log at LSN=7738417774
Jul 17 17:37:38 bark mariadbd[136468]: 2025-07-17 17:37:38 0 [Note] InnoDB: To recover: LSN 6252420442/7738417774; 298499 pages
Jul 17 17:37:38 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To recover: LSN 6252420442/7738417774; 298499 pages, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:37:53 bark mariadbd[136468]: 2025-07-17 17:37:53 0 [Note] InnoDB: To recover: LSN 6252420442/7738417774; 221234 pages
Jul 17 17:37:53 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To recover: LSN 6252420442/7738417774; 221234 pages, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:38:12 bark mariadbd[136468]: 2025-07-17 17:38:12 0 [Note] InnoDB: Parsed redo log up to LSN=6254520711; to recover: 113 pages
Jul 17 17:38:12 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=Parsed redo log up to LSN=6254520711; to recover: 113 pages, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:38:35 bark mariadbd[136468]: 2025-07-17 17:38:35 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 119437621 row operations to undo
Jul 17 17:38:35 bark mariadbd[136468]: 2025-07-17 17:38:35 0 [Note] InnoDB: Trx id counter is 65
Jul 17 17:38:35 bark mariadbd[136468]: 2025-07-17 17:38:35 0 [Note] InnoDB: To recover: 49476 pages
Jul 17 17:38:35 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To recover: 49476 pages, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] InnoDB: 128 rollback segments are active.
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] InnoDB: log sequence number 7738417774; transaction id 66
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] InnoDB: Buffer pool(s) load completed at 250717 17:38:36
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] Plugin 'FEEDBACK' is disabled.
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Warning] Failed to lock memory. Errno: 12
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] Server socket created on IP: '0.0.0.0'.
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] Server socket created on IP: '::'.
Jul 17 17:38:36 bark mariadbd[136468]: 2025-07-17 17:38:36 0 [Note] /usr/libexec/mariadbd: ready for connections.
Jul 17 17:38:36 bark mariadbd[136468]: Version: '10.11.11-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server
Jul 17 17:38:36 bark systemd[1]: mariadb.service: Got notification message from PID 136468: READY=1, STATUS=Taking your SQL requests now...
Jul 17 17:38:50 bark mariadbd[136468]: 2025-07-17 17:38:50 0 [Note] InnoDB: To roll back: 1 transactions, 113703154 rows
Jul 17 17:38:50 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 113703154 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:39:05 bark mariadbd[136468]: 2025-07-17 17:39:05 0 [Note] InnoDB: To roll back: 1 transactions, 106535991 rows
Jul 17 17:39:05 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 106535991 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:39:20 bark mariadbd[136468]: 2025-07-17 17:39:20 0 [Note] InnoDB: To roll back: 1 transactions, 99706467 rows
Jul 17 17:39:20 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 99706467 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:39:35 bark mariadbd[136468]: 2025-07-17 17:39:35 0 [Note] InnoDB: To roll back: 1 transactions, 93519818 rows
Jul 17 17:39:35 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 93519818 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:39:50 bark mariadbd[136468]: 2025-07-17 17:39:50 0 [Note] InnoDB: To roll back: 1 transactions, 88048935 rows
Jul 17 17:39:50 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 88048935 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:40:05 bark mariadbd[136468]: 2025-07-17 17:40:05 0 [Note] InnoDB: To roll back: 1 transactions, 82410278 rows
Jul 17 17:40:05 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 82410278 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:40:20 bark mariadbd[136468]: 2025-07-17 17:40:20 0 [Note] InnoDB: To roll back: 1 transactions, 76985567 rows
Jul 17 17:40:20 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 76985567 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:40:35 bark mariadbd[136468]: 2025-07-17 17:40:35 0 [Note] InnoDB: To roll back: 1 transactions, 71338601 rows
Jul 17 17:40:35 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 71338601 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:40:44 bark systemd[1]: mariadb.service: Changed dead -> running
Jul 17 17:40:50 bark mariadbd[136468]: 2025-07-17 17:40:50 0 [Note] InnoDB: To roll back: 1 transactions, 65634159 rows
Jul 17 17:40:50 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 65634159 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:41:05 bark mariadbd[136468]: 2025-07-17 17:41:05 0 [Note] InnoDB: To roll back: 1 transactions, 60100155 rows
Jul 17 17:41:05 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 60100155 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:41:20 bark mariadbd[136468]: 2025-07-17 17:41:20 0 [Note] InnoDB: To roll back: 1 transactions, 54421067 rows
Jul 17 17:41:20 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 54421067 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:41:35 bark mariadbd[136468]: 2025-07-17 17:41:35 0 [Note] InnoDB: To roll back: 1 transactions, 48713817 rows
Jul 17 17:41:35 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 48713817 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:41:50 bark mariadbd[136468]: 2025-07-17 17:41:50 0 [Note] InnoDB: To roll back: 1 transactions, 42858084 rows
Jul 17 17:41:50 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 42858084 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:42:05 bark mariadbd[136468]: 2025-07-17 17:42:05 0 [Note] InnoDB: To roll back: 1 transactions, 37420281 rows
Jul 17 17:42:05 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 37420281 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:42:20 bark mariadbd[136468]: 2025-07-17 17:42:20 0 [Note] InnoDB: To roll back: 1 transactions, 31788475 rows
Jul 17 17:42:20 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 31788475 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:42:35 bark mariadbd[136468]: 2025-07-17 17:42:35 0 [Note] InnoDB: To roll back: 1 transactions, 25956272 rows
Jul 17 17:42:35 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 25956272 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:42:50 bark mariadbd[136468]: 2025-07-17 17:42:50 0 [Note] InnoDB: To roll back: 1 transactions, 19985067 rows
Jul 17 17:42:50 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 19985067 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:43:05 bark mariadbd[136468]: 2025-07-17 17:43:05 0 [Note] InnoDB: To roll back: 1 transactions, 14195398 rows
Jul 17 17:43:05 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 14195398 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:43:20 bark mariadbd[136468]: 2025-07-17 17:43:20 0 [Note] InnoDB: To roll back: 1 transactions, 8476108 rows
Jul 17 17:43:20 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 8476108 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:43:35 bark mariadbd[136468]: 2025-07-17 17:43:35 0 [Note] InnoDB: To roll back: 1 transactions, 2454358 rows
Jul 17 17:43:35 bark systemd[1]: mariadb.service: Got notification message from PID 136468: STATUS=To roll back: 1 transactions, 2454358 rows, EXTEND_TIMEOUT_USEC=30000000
Jul 17 17:43:41 bark mariadbd[136468]: 2025-07-17 17:43:41 0 [Note] InnoDB: Rolled back recovered transaction 38
Jul 17 17:43:41 bark mariadbd[136468]: 2025-07-17 17:43:41 0 [Note] InnoDB: Rollback of non-prepared transactions completed

Every 15-18 seconds during recovery there was a message to extend the timeout by 30 seconds, until the service was declared READY=1 where the startup phase was officially over.

After the startup the recovery was sending 30 second extensions every 15 seconds during runtime for transaction rollback. Should a user configure RuntimeMaxSec= something by default this extends this to complete the rollback. This also could occur into shutdown in the rolback is continuing.

@dr-m, I'm sure you looked when you set these before, but are you happy with these intervals?

This was a recovery from a sigkill while inserting on a non-empty table on 10G buffer pool/log file size:

MariaDB [test]> insert into t select seq from seq_1_to_300000000;
ERROR 2013 (HY000): Lost connection to server during query

@dr-m
Copy link
Contributor

dr-m commented Jul 18, 2025

After the startup the recovery was sending 30 second extensions every 15 seconds during runtime for transaction rollback. Should a user configure RuntimeMaxSec= something by default this extends this to complete the rollback. This also could occur into shutdown in the rolback is continuing.

@dr-m, I'm sure you looked when you set these before, but are you happy with these intervals?

It generally looks OK to me. By default, InnoDB indeed rolls back all incomplete transactions as part of a shutdown. It would be interesting to repeat this experiment with innodb_fast_shutdown=3, which should allow the shutdown leave behind incomplete transactions. In that case, I hope that the timeout would not be extended too much.

Recently, I noticed that the progress reporting for innodb_fast_shutdown=0 (completing the purge of history) is missing. The server would seemingly hang for up to several minutes. This can be reproduced by running Sysbench oltp_update_index for a few minutes and executing the following:

SET GLOBAL innodb_fast_shutdown=0;
SHUTDOWN;

@CLAassistant
Copy link

CLAassistant commented Jul 19, 2025

CLA assistant check
All committers have signed the CLA.

@grooverdan
Copy link
Member

InnoDB changes to handle timeout effectively made part of https://jira.mariadb.org/browse/MDEV-37283.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.
Development

Successfully merging this pull request may close these issues.

6 participants