-
Notifications
You must be signed in to change notification settings - Fork 10
Actual listing table for S3 orphaned objects monitoring #352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR implements a persistent ClickHouse MergeTree table for S3 object listing with secure credential propagation, coordinates table ownership across replicas using ZooKeeper, and dynamically selects remote engine based on port configuration. Sequence Diagram for Conditional Final Drop of Listing Table with ZooKeepersequenceDiagram
actor UserScript as "User/Script"
participant CurrentCH as "Current CH Node"
participant ZooKeeper
Note over UserScript, CurrentCH: In 'finally' block of clean() function
opt Not keep_paths (Cleanup requested)
CurrentCH->>CurrentCH: effective_listing_table // This is the table name string (local or remote) defined earlier in clean()
CurrentCH->>CurrentCH: Call _drop_table(effective_listing_table, user_password) // May target a remote owner
alt has_zk() is true AND check_zk_node(ZK_PATH_LISTING_TABLE_INFO) is true
CurrentCH->>ZooKeeper: delete_zk_node(ZK_PATH_LISTING_TABLE_INFO)
end
end
Entity Relationship Diagram for S3 Object Listing StorageerDiagram
listing_table {
String obj_path PK "S3 Object Path"
UInt64 obj_size "Object Size (bytes)"
String storage_policy "ClickHouse storage policy for this table's data"
}
note "ClickHouse MergeTree table storing S3 object paths and sizes. Ordered by obj_path." for listing_table
zk_listing_table_info {
String replica_owner PK "Hostname of CH replica owning the listing_table"
}
note "Data in ZooKeeper node at ZK_PATH_LISTING_TABLE_INFO. Coordinates ownership of the 'listing_table' in a multi-replica setup." for zk_listing_table_info
Class Diagram of Changes in
|
Change | Details | Files |
---|---|---|
Securely manage listing table lifecycle using Query abstraction |
|
ch_tools/common/commands/clean_object_storage.py |
Coordinate listing table ownership across replicas with ZooKeeper |
|
ch_tools/common/commands/clean_object_storage.py ch_tools/chadmin/internal/zookeeper.py |
Automatically select appropriate remote engine based on port |
|
ch_tools/common/commands/clean_object_storage.py |
Add placeholder version file |
|
version.txt |
Tips and commands
Interacting with Sourcery
- Trigger a new review: Comment
@sourcery-ai review
on the pull request. - Continue discussions: Reply directly to Sourcery's review comments.
- Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with@sourcery-ai issue
to create an issue from it. - Generate a pull request title: Write
@sourcery-ai
anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title
on the pull request to (re-)generate the title at any time. - Generate a pull request summary: Write
@sourcery-ai summary
anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment@sourcery-ai summary
on the pull request to
(re-)generate the summary at any time. - Generate reviewer's guide: Comment
@sourcery-ai guide
on the pull
request to (re-)generate the reviewer's guide at any time. - Resolve all Sourcery comments: Comment
@sourcery-ai resolve
on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore. - Dismiss all Sourcery reviews: Comment
@sourcery-ai dismiss
on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review
to trigger a new review!
Customizing Your Experience
Access your dashboard to:
- Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others. - Change the review language.
- Add, remove or edit custom review instructions.
- Adjust other review settings.
Getting Help
- Contact our support team for questions or feedback.
- Visit our documentation for detailed guides and information.
- Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.
d70b867
to
ded187e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @NikitaUnisikhin - I've reviewed your changes and found some issues that need to be addressed.
Blocking issues:
- Potential SQL injection from unescaped object paths (link)
Here's what I looked at during the review
- 🟡 General issues: 4 issues found
- 🟢 Security: all looks good
- 🟢 Review instructions: all looks good
- 🟢 Testing: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
listing_table_info = get_zk_node(ctx, ZK_PATH_LISTING_TABLE_INFO) | ||
data = json.loads(listing_table_info) | ||
replica_owner = data["replica_owner"] | ||
listing_table = f"{remote_clause}('{replica_owner}', {listing_table}, '{user_name}', '{{user_password}}')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Incorrect quoting and interpolation in remote() function call
Wrap the table name in quotes and use single braces for password interpolation: f"{remote_clause}('replica_owner', '{listing_table}', '{user_name}', '{user_password}')"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in remote_data_paths_table the same is done and everything works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get why it seems to work in your remote_data_paths_table
helper today, but the way you’re constructing the SQL here is still going to bite you in two ways:
- You never wrap the fully‐qualified table name in quotes, so if your database or table names ever contain letters that ClickHouse treats as identifiers (hyphens, dots, uppercase letters, etc.) you’ll blow up at runtime.
- By using
{{user_password}}
inside an f-string you’re literally embedding the braces into the SQL, not your password.
Even if it “works” today, it’s fragile and leaves you open to SQL-injection between your Python interpolation and the ClickHouse parser. I’d recommend you switch to the same pattern we use everywhere else—single braces for f-strings and quoted identifiers—and, ideally, wrap the whole thing in a Query(…, sensitive_args=…)
so you never accidentally leak a password into your logs. For example:
# build a properly quoted identifier for the table
qualified_table = f"{config['listing_table_database']}.`{config['listing_table_prefix']}{disk_conf.name}`"
# then construct the remote() call with single braces around the password
listing_table = (
f"{remote_clause}('{replica_owner}', '{qualified_table}', "
f"'{user_name}', '{user_password}')"
)
# if you’re using our Query wrapper you could do:
listing_table_query = Query(
listing_table,
sensitive_args={"user_password": user_password}
)
This way you’re guaranteed correct quoting and you never accidentally leave your password in a log or stack trace. Can you give that a try?
475b5c4
to
c26049b
Compare
736b146
to
ebe2678
Compare
ebe2678
to
0ff237c
Compare
from ch_tools.monrun_checks.clickhouse_info import ClickhouseInfo | ||
|
||
|
||
def has_zk(ctx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add type hints for all new code
DEFAULT_GUARD_INTERVAL = "24h" | ||
KEYS_BATCH_SIZE = 100_000 | ||
# The ZK node along this path contains information about listing table. | ||
LISTING_TABLE_INFO_ZK_PATH = "/object-storage-cleanup-state/listing-table-info" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving state_zk_path
to the config? if it is present has_zk
will not needed anymore.
# The ZK node along this path contains information about listing table. | ||
LISTING_TABLE_INFO_ZK_PATH = "/object-storage-cleanup-state/listing-table-info" | ||
# Field of the replica on which the listing table is located. | ||
LISTING_TABLE_REPLICA_OWNER_FIELD = "replica_owner" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend OrphanedObjectsState
?
listing_table_info = get_zk_node(ctx, LISTING_TABLE_INFO_ZK_PATH) | ||
info = json.loads(listing_table_info) | ||
replica_owner = info[LISTING_TABLE_REPLICA_OWNER_FIELD] | ||
listing_table = f"{_get_remote_clause(ctx)}('{replica_owner}', {listing_table}, '{user_name}', '{{user_password}}')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use remote only if we use --use-saved-list
. Otherwise we will insert in remote()
I'm not sure it is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that idea was we use remote
during the week for invokings with --use-saved-list
, but once a week, replicas should under lock decide who does full listing and write this into zk-state. And again all --use-saved-list
invokings use remote
to replicas that last made the listing.
Summary by Sourcery
Implement persistent listing table management for S3 orphaned objects cleanup using Zookeeper and ClickHouse remote table functions.
New Features:
Enhancements:
Chores: