[Enhancement] Try to cleanup lingering transactions when restoring in exactly-once mode #271

banmoy · 2023-08-28T02:18:12Z

What type of PR is this：

Which issues of this PR fixes ：

Fixes #

Problem Summary(Required) ：

What's the problem

When using exactly-once, the connector will not abort the PREPARED transactions when the flink job failovers or exits because it's 2PC mechanism. Some of those PREPARED transactions may be in a successful checkpoint, and will be committed when the job restores from the checkpoint, but some of them are just useless, and should be aborted, otherwise they will be lingering in StarRocks until timeout which maybe make StarRocks unstable. We should try to abort those lingering transactions when restoring

How to solve it

When flink job restores, the connector will try to find those lingering transactions, and abort them. The key is how to find those transactions because the labels of them are not stored in the checkpoint. Here we design a label generator (ExactlyOnceLabelGenerator) to solve it

the user must set option sink.label-prefix which is used as the prefix of the labels, and it must be unique across all the ingestions, including flink connector, broker load and routine load, running on the same StarRocks cluster
the connector will generate label in the format {labelPrefix}-{tableName}-{subtaskIndex}-{id}.
- the subtaskIndex will make the label unique across subtasks if the sink writes parallel
- id is incremental, and it will make the label unique across different transactions in a subtask
when checkpointing, current id will be stored as the state in the checkpoint, and the labels whose ids are less than the current id must be successful, and only those labels whose ids are equal or larger than the current id can be lingering
when restoring, read the current id from the checkpoint, construct the label with the id, and get label status from StarRocks. The transaction is lingering if it's in PREPARED state, and should abort it

Checklist:

I have added test cases for my bug fix or my new feature
This pr will affect users' behaviors
This pr needs user documentation (for new or modified features or behaviors)
I have added documentation for my new feature or new function

dyp12 · 2023-08-29T03:38:32Z

This PREPARED transaction is because the import encountered an exception. So when the import encountered an exception,Should the PREPARED transaction be closed and then restarted ? Each task needs to be set with sink.label-prefix, which makes it easy to repeat

banmoy · 2023-08-29T09:53:54Z

@dyp12 Thanks for your comments

The transaction is set to PREPARED because a checkpoint is triggered, and we need to prepare it, see StarRocksDynamicSinkFunctionV2#snapshotState(). It will be committed finally when StarRocksDynamicSinkFunctionV2#notifyCheckpointComplete() is called which indicates the Flink checkpoint is successful globally. This is the two-phase-commit mechanism of Flink to implement exactly-once. Before notifyCheckpointComplete, we can not abort it even an exception happens because it may lead to data loss if the checkpoint is successful globally but has not notified this subtask.
It brings burden for users to keep sink.label-prefix unique, but it seems there is not a better solution currently. The solution is similar to that of Flink Kafka connector, see sink.transactional-id-prefix

Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

… exactly-once mode (StarRocks#271) Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

… exactly-once mode (#271) Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

Ecthlion · 2025-06-24T15:23:30Z

For "Some of those PREPARED transactions may be in a successful checkpoint, and will be committed when the job restores from the checkpoint", if fe restore when this txn is PREPARED, the callback instance in fe will lose, and this txn will never be commited, but is also cannot be rollback by connector because of exactly-once. how to solve this problem? is this a bug in sr or connector?

banmoy force-pushed the label_generator branch from 7ebbde1 to fd46059 Compare August 28, 2023 13:40

banmoy added 4 commits August 30, 2023 20:25

[Refactor] Use label generator to generate labels

11322bd

Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

[Refactor] Refact transaction status

d2c8799

Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

[Enhancement] Try to abort labels

a9ee213

Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

[UT] Add tests

e85f108

Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

banmoy force-pushed the label_generator branch from fd46059 to e85f108 Compare August 30, 2023 12:25

Set label generator factory

1709dc2

Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

xlfjcg approved these changes Aug 31, 2023

View reviewed changes

banmoy merged commit 83919ff into StarRocks:main Sep 1, 2023

banmoy added a commit to banmoy/starrocks-connector-for-apache-flink that referenced this pull request Sep 2, 2023

[Enhancement] Try to cleanup lingering transactions when restoring in…

be9022d

… exactly-once mode (StarRocks#271) Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

banmoy added a commit to banmoy/starrocks-connector-for-apache-flink that referenced this pull request Sep 2, 2023

[Enhancement] Try to cleanup lingering transactions when restoring in…

db32d50

… exactly-once mode (StarRocks#271) Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

banmoy added a commit that referenced this pull request Sep 11, 2023

[Enhancement] Try to cleanup lingering transactions when restoring in…

137ceb7

… exactly-once mode (#271) Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

banmoy added a commit that referenced this pull request Sep 11, 2023

[Enhancement] Try to cleanup lingering transactions when restoring in…

71c194f

… exactly-once mode (#271) Signed-off-by: PengFei Li <lpengfei2016@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Try to cleanup lingering transactions when restoring in exactly-once mode #271

[Enhancement] Try to cleanup lingering transactions when restoring in exactly-once mode #271

Uh oh!

banmoy commented Aug 28, 2023 •

edited

Loading

Uh oh!

dyp12 commented Aug 29, 2023

Uh oh!

banmoy commented Aug 29, 2023 •

edited

Loading

Uh oh!

Ecthlion commented Jun 24, 2025

Uh oh!

Uh oh!

[Enhancement] Try to cleanup lingering transactions when restoring in exactly-once mode #271

[Enhancement] Try to cleanup lingering transactions when restoring in exactly-once mode #271

Uh oh!

Conversation

banmoy commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this：

Which issues of this PR fixes ：

Problem Summary(Required) ：

What's the problem

How to solve it

Checklist:

Uh oh!

dyp12 commented Aug 29, 2023

Uh oh!

banmoy commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ecthlion commented Jun 24, 2025

Uh oh!

Uh oh!

banmoy commented Aug 28, 2023 •

edited

Loading

banmoy commented Aug 29, 2023 •

edited

Loading