这是indexloc提供的服务,不要输入任何密码
Skip to content

MDEV-37116 Replication should remain async until the primary receives rpl_semi_sync_master_wait_for_slave_count no. of ACKs #4171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: mdev-18983
Choose a base branch
from

Conversation

ParadoxV5
Copy link
Contributor

@ParadoxV5 ParadoxV5 commented Jul 8, 2025

Description

rpl_semi_sync_master_wait_for_slave_count from MDEV-18983 only applied when Semi-Sync is active, but not to the condition to automatically return to Semi-Sync after it fell back to Async (due to timeout or rpl_semi_sync_master_wait_no_slave=0).

This commit expands the Semi-Sync transaction queue – specifically, its ACKs counter from MDEV-18983 – to be also used during Async.

  • When Semi-Sync automatically falls back to Async, it no longer clears the queue, only lets the queued transactions finish.
  • Async transactions also enqueue, but without a waiting thread.
  • The condition to return from Async to Semi-Sync (and dequeue past Async transactions) is now inside the …_wait_for_slave_count check.

Release Notes

Amend MDEV-18983 (#4037)’s notes.

How can this PR be tested?

I’ve updated rpl.rpl_semi_sync_master_wait_for_slave_count to cover the issue.

Basing the PR against the correct MariaDB version

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

… `rpl_semi_sync_master_wait_for_slave_count` no. of ACKs

`rpl_semi_sync_master_wait_for_slave_count` from MDEV-18983
only applied when Semi-Sync is active, but not to the condition
to automatically return to Semi-Sync after it fell back to Async
(due to timeout or `rpl_semi_sync_master_wait_no_slave=0`).

This commit expands the Semi-Sync transaction queue – specifically,
its ACKs counter from MDEV-18983 – to be also used during Async.
* When Semi-Sync automatically falls back to Async, it no longer
  clears the queue, only lets the queued transactions finish.
* Async transactions also enqueue, but without a waiting thread.
* The condition to return from Async to Semi-Sync (and dequeue past
  Async transactions) is now inside the `…_wait_for_slave_count` check.

Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
@ParadoxV5 ParadoxV5 added MariaDB Corporation Replication Patches involved in replication labels Jul 8, 2025
Copy link
Contributor Author

@ParadoxV5 ParadoxV5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No CI in PRs not for main or X.Y? 😞

Comment on lines +1308 to +1314
DBUG_ASSERT(m_active_tranxs != NULL);
/*
Add a node even if semi-sync is temporarily down to track turning back
on when there are `rpl_semi_sync_master_wait_for_slave_count` acks
*/
if (m_active_tranxs->insert_tranx_node(is_on() ? thd : nullptr,
log_file_name, log_file_pos))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I’m concerned that this natural solution leads to an indefinite increase in memory usage as long as the Async is indefinite (due to a lack of monitoring, I suppose).

Copy link
Contributor

@bnestere bnestere Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! This should definitely have some safety. Eg. when adding a new node, if there are more than 64 (say?) nodes already in the list, then we clear the node at the front of the queue so there is some max number (where this condition only applies when semi-sync is off). And we should probably output a warning once we enter the state where we start deleting nodes from this list (though likely not a warning for every deleted node, just that we've entered into that state).

Some optimization could be to clear nodes in batches too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t like dropping records.
This 64 should instead be a user-configurable queue limit that disables Semi-Sync when hit.


The feature and, by extension, this bug must distinguish ACKs between different replicas.
The feature relied on that ACKs from the same replica are always for different transactions.

An alternative to indexing by transactions is to index by replica and track their reply(s); for example, a set or a boolean map to solve the bug.

The feature could’ve used this solution as well.
It is reminiscent of MySQL’s implementation, and its performance likely scales better for MDEV-33491 Semi-sync Replication Group ACK: by the number of replicas rather than transactions in the group.
It’s resistant to double-send too (if that’s ever a concern).


Speaking of Group ACK, it reminded me of a concern about the feature and, by extension, this bug too.
Suppose there are two concurrent waiting transactions, A & B.
What should happen if the spotty network drops A’s ACK, but B’s makes it?

If there are …_wait_for_slave_count is 2 (and there are at least 2 replicas), if

  • For one replica, spotty network drops A’s ACK, but B’s makes it
  • Another replica replies A, but is busy with B

I’m confident that I’ve left my implementation to not count A as having 2 ACKs.
Is this “right”?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to indexing by transactions is to index by replica and track their reply(s); for example, a set or a boolean map to solve the bug.

I don't quite grasp how this would work. Can you give an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context where that statement was, it means to use a set to track which replicas have ACKed during the fallback to Async.
In contrast, the current draft relies on the transaction position to identify ACKs (of different transactions) from the same replica.

In the context of the alternate feature implementation, it suggests repurposing it for Semi-Sync as well: change the set to a map from replicas to their ACKed positions, and replace Tranx_node::acks entirely.

Example with transactions A & B and replicas X & Y

  1. Start
    • Current: {"A": 0, "B": 0}
    • Proposed Fix: []
    • Proposed Feature: {"X": null, "Y": null}
  2. Receive ACK A from X
    • Current: {"A": 1, "B": 0}
    • Proposed Fix: ["X"]
    • Proposed Feature: {"X": "A", "Y": null}
  3. Receive ACK B from X
    • Current: {"A": 1, "B": 1}
    • Proposed Fix: ["X"] (It does not care about specific transactions.)
    • Proposed Feature: {"X": "B", "Y": null} (It only cares about the farthest position.)
  4. Receive ACK B from Y (ACK A from X rolled one in a million and was lost on the internet)
    • Current: {"A": 1, "B": 2}
      • With B as a Group ACK: {"A": 2, "B": 2}
    • Proposed Fix: ["X", "Y"]
    • Proposed Feature: {"X": "B", "Y": "B"}
  5. Continue Semi-Sync if …_wait_for_slave_count=2
    • Current: The value of the most recently ACKed transaction (B) is 2.
    • Proposed Fix: The size of the set is 2.
    • Proposed Feature: The number of values at or after the most recently ACKed transaction (B) is 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with the idea you call "Proposed Feature" - in fact, we have a PR which already provides the base for that behavior (PR #3288). It intends to add new fields to SHOW REPLICA HOSTS when run on the master, to also report on Gtid_Sent (to report the last GTID that the dump thread has sent to this replica); Gtid_Ack (to report the last GTID that was ACKed by that replica); and Sync_Status to report the idle/active state of the replica. Note that the GTID isn't actually what is saved, it is binlog file + pos, but then during SHOW REPLICA HOSTS it is converted to GTID.

It was temporarily abandoned because it added extra overhead in locking to save the binlog coordinates (particularly for Gtid_Sent) -- though we don't actually need that here, and Gtid_Ack is really the important one anyway, which is what would be used here. I wonder if we could re-purpose that patch here and simply drop the Gtid_Sent column to minimize overhead.

@knielsen you initially voiced the concerns for that PR; do you have any thoughts about re-using it here for more efficiently tracking ACKs from multiple slaves (and I'd think it would be good to keep the SHOW REPLICA HOSTS extensions as well now that transactions need multiple ACKs). And BTW, this will be a part of MDEV-18983, we are removing that feature from 12.2 and pushing it out to a later release to account for this MDEV-37116 finding/re-design

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.
I was thinking about adding the field to struct Slave of semisync_master_ack_receiver.h.
But now I see that repl_failsafe.cc has another Slave_info… where the struct Slave things could belong if Slave_info wasn’t private to the file.


{
if (!is_on()) // We check to see whether we can switch semi-sync ON ...
try_switch_on(server_id, log_file_name, log_file_pos);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe how m_reply_file_*, m_wait_file_* and m_commit_file_* change in Semi-Sync, @bnestere san?

I sense that, with …_wait_for_slave_count, at least one of them is now obsolete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are still relevant.

m_commit_file_* - The latest binlog position that a user transaction has written into the binlog. I think this should generally correspond to whatever the latest transaction written into the binlog (as long as semi sync is enabled). It is used to help determine whether or not the binlog_dump_thread should request an ack from the slave or not (need_sync in Rpl_semi_sync_master::update_sync_header()). It is updated while writing to the binlog (at the same point when the node is inserted into the Active_tranx queue).

m_wait_file_* - The earliest transaction that is still being waited on for an ACK. It is also used to help determine whether or not the binlog_dump_thread should request an ack from the slave or not. It is updated as a user transaction begins its semi-sync wait.

m_reply_file_* - The latest transaction that has been signalled to continue. This is used to determine whether or not a transaction needs to wait (i.e. if some future transaction has already been ACKed before this one goes to wait, then no wait is necessary). It is updated by the Ack thread after receiving a reply (though I think this changes in your patch to update after receiving all replies).

So I think the only thing that changes is when m_reply_file_* is updated- i.e., only after all ACKs have been received. It'd be good to update the comment for the variables in semisync_master.h to reflect that (I don't remember if your patch already does that or not).

Copy link
Contributor Author

@ParadoxV5 ParadoxV5 Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So,

  • For a completed Semi-Sync,

    • m_reply_file_* = m_commit_file_* if successful, or < if it fell back to Async
    • m_wait_file_* is null
  • For an ongoing Semi-Sync,

    • m_reply_file_* < m_wait_file_*m_commit_file_*
    • m_commit_file_* is the last entry in the queue ordered list, as they’re updated at the same time.
      • This is why I feel that, benchmarking aside, “is before m_commit_file_*” is equivalent to is_tranx_end_pos(), especially if Async transactions are also in the list.
    • m_wait_file_* could be the first entry in the list, as entries earlier than it are obsolete; though I see in the code that it only decreases.

Copy link
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ParadoxV5 !

Thanks for the PR! The only main thing to address is the memory leak you'd already brought out. The rest is pretty minor.

A couple other things:

  1. Can you extend the first paragraph of the git commit message to also explain how semi-sync is reactivated? This will give the full picture of the inconsistency.
  2. Can you extend the second paragraph of the git commit message's first sentence to explain why the queue will be used during async? E.g., "to be also used during async in order to...."
  3. When actually merging 18983 into the 12.1 branch, you can either squash this commit in with the overall 18983 work or keep it as a separate commit. If you squash, please update the 18983 commit message with a reference to this JIRA, as well as a high-level description of what is required to switch from async back to semi-sync.

@@ -361,14 +361,24 @@ class Active_tranx
int insert_tranx_node(THD *thd_to_wait, const char *log_file_name,
my_off_t log_file_pos);

/**
Commit the active transaction nodes until (exclusive) the specified node. If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "commit" here is a bit misleading, perhaps "Signal the active transaction nodes to complete" would be more accurate?

Commit the active transaction nodes until (exclusive) the specified node. If
it's not in the collection (e.g., is `nullptr`), everything will be flushed.

@return whether the node is *not* in the collection or is `nullptr`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than a return value (which is conventionally an error status), it would read better to have this as an output variable (e.g. bool *until_node_found).


#ifndef DBUG_OFF
#ifndef DBUG_OFF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whitespace added should be removed

Suggested change
#ifndef DBUG_OFF
#ifndef DBUG_OFF

@@ -151,6 +156,12 @@ while ($wait_no_slave < 2)
--inc $wait_no_slave
}

--echo # Case 2.4: "Stays Async" is not "fails to return to Semi-Sync"
# The number of timeouts should be 2, 1 from Case 2.1 per loop.
#TODO: MDEV-36936 Expose rpl_semi_sync_master_wait_timeouts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rpl_semi_sync_master_no_tx doesn't suffice?

Copy link
Contributor Author

@ParadoxV5 ParadoxV5 Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Case 2.2 (does not reactivate) and 2.3 (does reactivate) try to reactivate Semi-Sync and then test with Rpl_semi_sync_master_yes/no_tx if it succeeds.
In Case 2.2, the reactivation occurred only to timeout again in the test, hence this bug.

@@ -361,14 +361,24 @@ class Active_tranx
int insert_tranx_node(THD *thd_to_wait, const char *log_file_name,
my_off_t log_file_pos);

/**
Commit the active transaction nodes until (exclusive) the specified node. If
it's not in the collection (e.g., is `nullptr`), everything will be flushed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also suggest some additional comments (perhaps both here and in clear_...) to make clear the nuance between the functions (flush_.. and clear_...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MariaDB Corporation Replication Patches involved in replication
Development

Successfully merging this pull request may close these issues.

2 participants