-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
MDEV-36009: Systemd: Restart on OOM #4182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 10.6
Are you sure you want to change the base?
Conversation
Per systemd/systemd#36529 OOM counts as a on-abnormal condition. To ensure that MariaDB testart on OOM the Restart is changes to on-abnormal which an extension on the current on-abort condition.
what does "timeout" mean in the MariaDB case? |
Timeout is when the start or stop time exceeds the 900 second(15min) + any extend timeout (usually 60 seconds) messages sent. So long crash recovery should be normally ok, because the extend timeout sends periodic messages, however extreme stalled hardware or CPU could cause a mismatch between mariadb advised timeout extensions and the ability to achieve progress in that time. Galera joiner/donor SST threads also operate on extending timeout (though I'm a bit concerned on their implementation look at it - too close to limit). Surprising me here also is that stop timeouts are included in when to restart. Shutdown timeout is extended with innodb buffer pool dump every 1k pages, shutdown innodb flushing, innodb redo log writing during shutdown, waiting for innodb transactions to finish, encryption thread termination, page cleaner thread termination, change buffer merge. With the number of extensions here I'm doubting a shutdown timeout could occur. |
May be, let's increase the timeout to be more certain it won't happen? Restarting on startup timeout sounds like a sure way to create an infinite restart loop |
I rechecked galera SST - its all good there. Summary: There's no real gain in extending the start or stop timeout. InnoDB recovery is already extending the timeouts predictably. Its more important that this extension is covered correctly than increasing the 15minute start. Long version: This is what a startup look like with a large innodb crash recovery where the start/stop timeout was reduced to 40 seconds:
Every 15-18 seconds during recovery there was a message to extend the timeout by 30 seconds, until the service was declared READY=1 where the startup phase was officially over. After the startup the recovery was sending 30 second extensions every 15 seconds during runtime for transaction rollback. Should a user configure RuntimeMaxSec= something by default this extends this to complete the rollback. This also could occur into shutdown in the rolback is continuing. @dr-m, I'm sure you looked when you set these before, but are you happy with these intervals? This was a recovery from a sigkill while inserting on a non-empty table on 10G buffer pool/log file size:
|
It generally looks OK to me. By default, InnoDB indeed rolls back all incomplete transactions as part of a shutdown. It would be interesting to repeat this experiment with Recently, I noticed that the progress reporting for SET GLOBAL innodb_fast_shutdown=0;
SHUTDOWN; |
InnoDB changes to handle timeout effectively made part of https://jira.mariadb.org/browse/MDEV-37283. |
Per systemd/systemd#36529 OOM counts as a on-abnormal condition. To ensure that MariaDB testart on OOM the Restart is changes to on-abnormal which an extension on the current on-abort condition.
Description
The previous PR was only ported to up to 10.11. This PR ports it to 10.6 which is in ubuntu jammy.