MDEV-37116 Replication should remain async until the primary receives `rpl_semi_sync_master_wait_for_slave_count` no. of ACKs #4171

ParadoxV5 · 2025-07-08T00:55:33Z

The Jira issue number for this PR is: MDEV-37116
- Follow-up to MDEV-18983

Description

rpl_semi_sync_master_wait_for_slave_count from MDEV-18983 only applied when Semi-Sync is active, but not to the condition to automatically return to Semi-Sync after it fell back to Async (due to timeout or rpl_semi_sync_master_wait_no_slave=0).

This commit expands the Semi-Sync transaction queue – specifically, its ACKs counter from MDEV-18983 – to be also used during Async.

When Semi-Sync automatically falls back to Async, it no longer clears the queue, only lets the queued transactions finish.
Async transactions also enqueue, but without a waiting thread.
The condition to return from Async to Semi-Sync (and dequeue past Async transactions) is now inside the …_wait_for_slave_count check.

Release Notes

Amend MDEV-18983 (#4037)’s notes.

How can this PR be tested?

I’ve updated rpl.rpl_semi_sync_master_wait_for_slave_count to cover the issue.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.
- This bug is a finding on MDEV-18983, an unreleased new feature in the middle of QA, so I created a separate branch based on that branch, MDEV-18983 Port rpl_semi_sync_master_wait_for_slave_count from MySQL #4037; okay?

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

… `rpl_semi_sync_master_wait_for_slave_count` no. of ACKs `rpl_semi_sync_master_wait_for_slave_count` from MDEV-18983 only applied when Semi-Sync is active, but not to the condition to automatically return to Semi-Sync after it fell back to Async (due to timeout or `rpl_semi_sync_master_wait_no_slave=0`). This commit expands the Semi-Sync transaction queue – specifically, its ACKs counter from MDEV-18983 – to be also used during Async. * When Semi-Sync automatically falls back to Async, it no longer clears the queue, only lets the queued transactions finish. * Async transactions also enqueue, but without a waiting thread. * The condition to return from Async to Semi-Sync (and dequeue past Async transactions) is now inside the `…_wait_for_slave_count` check. Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>

ParadoxV5

No CI in PRs not for main or X.Y? 😞

ParadoxV5 · 2025-07-08T01:04:58Z

sql/semisync_master.cc

+  DBUG_ASSERT(m_active_tranxs != NULL);
+  /*
+    Add a node even if semi-sync is temporarily down to track turning back
+    on when there are `rpl_semi_sync_master_wait_for_slave_count` acks
+  */
+  if (m_active_tranxs->insert_tranx_node(is_on() ? thd : nullptr,
+                                         log_file_name, log_file_pos))


However, I’m concerned that this natural solution leads to an indefinite increase in memory usage as long as the Async is indefinite (due to a lack of monitoring, I suppose).

Great catch! This should definitely have some safety. Eg. when adding a new node, if there are more than 64 (say?) nodes already in the list, then we clear the node at the front of the queue so there is some max number (where this condition only applies when semi-sync is off). And we should probably output a warning once we enter the state where we start deleting nodes from this list (though likely not a warning for every deleted node, just that we've entered into that state).

Some optimization could be to clear nodes in batches too.

I don’t like dropping records.
This 64 should instead be a user-configurable queue limit that disables Semi-Sync when hit.

The feature and, by extension, this bug must distinguish ACKs between different replicas.
The feature relied on that ACKs from the same replica are always for different transactions.

An alternative to indexing by transactions is to index by replica and track their reply(s); for example, a set or a boolean map to solve the bug.

The feature could’ve used this solution as well.
It is reminiscent of MySQL’s implementation, and its performance likely scales better for MDEV-33491 Semi-sync Replication Group ACK: by the number of replicas rather than transactions in the group.
It’s resistant to double-send too (if that’s ever a concern).

Speaking of Group ACK, it reminded me of a concern about the feature and, by extension, this bug too.
Suppose there are two concurrent waiting transactions, A & B.
What should happen if the spotty network drops A’s ACK, but B’s makes it?

If there are …_wait_for_slave_count is 2 (and there are at least 2 replicas), if

For one replica, spotty network drops A’s ACK, but B’s makes it

Another replica replies A, but is busy with B

I’m confident that I’ve left my implementation to not count A as having 2 ACKs.
Is this “right”?

If it’s bad, then its fix is a setup towards MDEV-33491 Semi-sync Replication Group ACK.

If it’s good, then note that this index-by-replica alternative has poor compatibility with this behavior.

An alternative to indexing by transactions is to index by replica and track their reply(s); for example, a set or a boolean map to solve the bug.

I don't quite grasp how this would work. Can you give an example?

In the context where that statement was, it means to use a set to track which replicas have ACKed during the fallback to Async.
In contrast, the current draft relies on the transaction position to identify ACKs (of different transactions) from the same replica.

In the context of the alternate feature implementation, it suggests repurposing it for Semi-Sync as well: change the set to a map from replicas to their ACKed positions, and replace Tranx_node::acks entirely.

Example with transactions A & B and replicas X & Y

Start

Current: {"A": 0, "B": 0}

Proposed Fix: []

Proposed Feature: {"X": null, "Y": null}

Receive ACK A from X

Current: {"A": 1, "B": 0}

Proposed Fix: ["X"]

Proposed Feature: {"X": "A", "Y": null}

Receive ACK B from X

Current: {"A": 1, "B": 1}

Proposed Fix: ["X"] (It does not care about specific transactions.)

Proposed Feature: {"X": "B", "Y": null} (It only cares about the farthest position.)

Receive ACK B from Y (ACK A from X rolled one in a million and was lost on the internet)

Current: {"A": 1, "B": 2}

With B as a Group ACK: {"A": 2, "B": 2}

Proposed Fix: ["X", "Y"]

Proposed Feature: {"X": "B", "Y": "B"}

Continue Semi-Sync if …_wait_for_slave_count=2

Current: The value of the most recently ACKed transaction (B) is 2.

Proposed Fix: The size of the set is 2.

Proposed Feature: The number of values at or after the most recently ACKed transaction (B) is 2.

I'd agree with the idea you call "Proposed Feature" - in fact, we have a PR which already provides the base for that behavior (PR #3288). It intends to add new fields to SHOW REPLICA HOSTS when run on the master, to also report on Gtid_Sent (to report the last GTID that the dump thread has sent to this replica); Gtid_Ack (to report the last GTID that was ACKed by that replica); and Sync_Status to report the idle/active state of the replica. Note that the GTID isn't actually what is saved, it is binlog file + pos, but then during SHOW REPLICA HOSTS it is converted to GTID.

It was temporarily abandoned because it added extra overhead in locking to save the binlog coordinates (particularly for Gtid_Sent) -- though we don't actually need that here, and Gtid_Ack is really the important one anyway, which is what would be used here. I wonder if we could re-purpose that patch here and simply drop the Gtid_Sent column to minimize overhead.

@knielsen you initially voiced the concerns for that PR; do you have any thoughts about re-using it here for more efficiently tracking ACKs from multiple slaves (and I'd think it would be good to keep the SHOW REPLICA HOSTS extensions as well now that transactions need multiple ACKs). And BTW, this will be a part of MDEV-18983, we are removing that feature from 12.2 and pushing it out to a later release to account for this MDEV-37116 finding/re-design

Interesting.
I was thinking about adding the field to struct Slave of semisync_master_ack_receiver.h.
But now I see that repl_failsafe.cc has another Slave_info… where the struct Slave things could belong if Slave_info wasn’t private to the file.

ParadoxV5 · 2025-07-08T01:11:59Z

sql/semisync_master.cc

-
+    {
+      if (!is_on()) // We check to see whether we can switch semi-sync ON ...
+        try_switch_on(server_id, log_file_name, log_file_pos);


Can you describe how m_reply_file_*, m_wait_file_* and m_commit_file_* change in Semi-Sync, @bnestere san?

I sense that, with …_wait_for_slave_count, at least one of them is now obsolete.

I think they are still relevant.

m_commit_file_* - The latest binlog position that a user transaction has written into the binlog. I think this should generally correspond to whatever the latest transaction written into the binlog (as long as semi sync is enabled). It is used to help determine whether or not the binlog_dump_thread should request an ack from the slave or not (need_sync in Rpl_semi_sync_master::update_sync_header()). It is updated while writing to the binlog (at the same point when the node is inserted into the Active_tranx queue).

m_wait_file_* - The earliest transaction that is still being waited on for an ACK. It is also used to help determine whether or not the binlog_dump_thread should request an ack from the slave or not. It is updated as a user transaction begins its semi-sync wait.

m_reply_file_* - The latest transaction that has been signalled to continue. This is used to determine whether or not a transaction needs to wait (i.e. if some future transaction has already been ACKed before this one goes to wait, then no wait is necessary). It is updated by the Ack thread after receiving a reply (though I think this changes in your patch to update after receiving all replies).

So I think the only thing that changes is when m_reply_file_* is updated- i.e., only after all ACKs have been received. It'd be good to update the comment for the variables in semisync_master.h to reflect that (I don't remember if your patch already does that or not).

So,

For a completed Semi-Sync,

m_reply_file_* = m_commit_file_* if successful, or < if it fell back to Async

m_wait_file_* is null

For an ongoing Semi-Sync,

m_reply_file_* < m_wait_file_* ≤ m_commit_file_*

m_commit_file_* is the last entry in the ~~queue~~ ordered list, as they’re updated at the same time.

This is why I feel that, benchmarking aside, “is before m_commit_file_*” is equivalent to is_tranx_end_pos(), especially if Async transactions are also in the list.

m_wait_file_* could be the first entry in the list, as entries earlier than it are obsolete; though I see in the code that it only decreases.

bnestere

Hi @ParadoxV5 !

Thanks for the PR! The only main thing to address is the memory leak you'd already brought out. The rest is pretty minor.

A couple other things:

Can you extend the first paragraph of the git commit message to also explain how semi-sync is reactivated? This will give the full picture of the inconsistency.
Can you extend the second paragraph of the git commit message's first sentence to explain why the queue will be used during async? E.g., "to be also used during async in order to...."
When actually merging 18983 into the 12.1 branch, you can either squash this commit in with the overall 18983 work or keep it as a separate commit. If you squash, please update the 18983 commit message with a reference to this JIRA, as well as a high-level description of what is required to switch from async back to semi-sync.

bnestere · 2025-07-08T16:24:28Z

sql/semisync_master.h

@@ -361,14 +361,24 @@ class Active_tranx
  int insert_tranx_node(THD *thd_to_wait, const char *log_file_name,
                        my_off_t log_file_pos);

+  /**
+    Commit the active transaction nodes until (exclusive) the specified node. If


The term "commit" here is a bit misleading, perhaps "Signal the active transaction nodes to complete" would be more accurate?

bnestere · 2025-07-08T16:25:14Z

sql/semisync_master.h

+    Commit the active transaction nodes until (exclusive) the specified node. If
+    it's not in the collection (e.g., is `nullptr`), everything will be flushed.
+
+    @return whether the node is *not* in the collection or is `nullptr`


I think rather than a return value (which is conventionally an error status), it would read better to have this as an output variable (e.g. bool *until_node_found).

bnestere · 2025-07-08T16:26:14Z

sql/semisync_master.cc


-#ifndef DBUG_OFF
+    #ifndef DBUG_OFF


The whitespace added should be removed

Suggested change

#ifndef DBUG_OFF

#ifndef DBUG_OFF

bnestere · 2025-07-08T16:31:36Z

mysql-test/suite/rpl/t/rpl_semi_sync_master_wait_for_slave_count.test

@@ -151,6 +156,12 @@ while ($wait_no_slave < 2)
  --inc $wait_no_slave
 }

+--echo # Case 2.4: "Stays Async" is not "fails to return to Semi-Sync"
+# The number of timeouts should be 2, 1 from Case 2.1 per loop.
+#TODO: MDEV-36936 Expose rpl_semi_sync_master_wait_timeouts


Rpl_semi_sync_master_no_tx doesn't suffice?

Both Case 2.2 (does not reactivate) and 2.3 (does reactivate) try to reactivate Semi-Sync and then test with Rpl_semi_sync_master_yes/no_tx if it succeeds.
In Case 2.2, the reactivation occurred only to timeout again in the test, hence this bug.

bnestere · 2025-07-08T16:37:57Z

sql/semisync_master.h

@@ -361,14 +361,24 @@ class Active_tranx
  int insert_tranx_node(THD *thd_to_wait, const char *log_file_name,
                        my_off_t log_file_pos);

+  /**
+    Commit the active transaction nodes until (exclusive) the specified node. If
+    it's not in the collection (e.g., is `nullptr`), everything will be flushed.


I'd also suggest some additional comments (perhaps both here and in clear_...) to make clear the nuance between the functions (flush_.. and clear_...)

ParadoxV5 requested review from bnestere and mariadb-SusilBehera July 8, 2025 00:55

ParadoxV5 added MariaDB Corporation Replication Patches involved in replication labels Jul 8, 2025

ParadoxV5 commented Jul 8, 2025

View reviewed changes

bnestere requested changes Jul 8, 2025

View reviewed changes

Uh oh!

MDEV-37116 Replication should remain async until the primary receives rpl_semi_sync_master_wait_for_slave_count no. of ACKs #4171

Are you sure you want to change the base?

MDEV-37116 Replication should remain async until the primary receives rpl_semi_sync_master_wait_for_slave_count no. of ACKs #4171

Uh oh!

Conversation

ParadoxV5 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

Uh oh!

ParadoxV5 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnestere Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Example with transactions A & B and replicas X & Y

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ParadoxV5 Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnestere left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ParadoxV5 Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MDEV-37116 Replication should remain async until the primary receives `rpl_semi_sync_master_wait_for_slave_count` no. of ACKs #4171

MDEV-37116 Replication should remain async until the primary receives `rpl_semi_sync_master_wait_for_slave_count` no. of ACKs #4171

ParadoxV5 commented Jul 8, 2025 •

edited

Loading

bnestere Jul 8, 2025 •

edited

Loading

ParadoxV5 Jul 8, 2025 •

edited

Loading

ParadoxV5 Jul 8, 2025 •

edited

Loading