+
Skip to content

Conversation

NikitaUnisikhin
Copy link
Contributor

@NikitaUnisikhin NikitaUnisikhin commented May 29, 2025

Summary by Sourcery

Implement persistent listing table management for S3 orphaned objects cleanup using Zookeeper and ClickHouse remote table functions.

New Features:

  • Track listing table ownership in Zookeeper under object-storage-cleanup-state/listing-table-info
  • Query the listing table across replicas via ClickHouse remote() or remoteSecure() engine

Enhancements:

  • Propagate ClickHouse client credentials through cleanup and insert functions
  • Wrap create, drop, and insert table queries in Query objects with sensitive_args to handle passwords securely
  • Automatically clean up Zookeeper nodes when dropping the listing table

Chores:

  • Introduce has_zk helper in Zookeeper utils based on replica count
  • Add a version.txt placeholder file

Copy link
Contributor

sourcery-ai bot commented May 29, 2025

Reviewer's Guide

This PR implements a persistent ClickHouse MergeTree table for S3 object listing with secure credential propagation, coordinates table ownership across replicas using ZooKeeper, and dynamically selects remote engine based on port configuration.

Sequence Diagram for Conditional Final Drop of Listing Table with ZooKeeper

sequenceDiagram
    actor UserScript as "User/Script"
    participant CurrentCH as "Current CH Node"
    participant ZooKeeper

    Note over UserScript, CurrentCH: In 'finally' block of clean() function
    opt Not keep_paths (Cleanup requested)
        CurrentCH->>CurrentCH: effective_listing_table // This is the table name string (local or remote) defined earlier in clean()
        CurrentCH->>CurrentCH: Call _drop_table(effective_listing_table, user_password) // May target a remote owner

        alt has_zk() is true AND check_zk_node(ZK_PATH_LISTING_TABLE_INFO) is true
            CurrentCH->>ZooKeeper: delete_zk_node(ZK_PATH_LISTING_TABLE_INFO)
        end
    end
Loading

Entity Relationship Diagram for S3 Object Listing Storage

erDiagram
    listing_table {
        String obj_path PK "S3 Object Path"
        UInt64 obj_size "Object Size (bytes)"
        String storage_policy "ClickHouse storage policy for this table's data"
    }
    note "ClickHouse MergeTree table storing S3 object paths and sizes. Ordered by obj_path." for listing_table

    zk_listing_table_info {
        String replica_owner PK "Hostname of CH replica owning the listing_table"
    }
    note "Data in ZooKeeper node at ZK_PATH_LISTING_TABLE_INFO. Coordinates ownership of the 'listing_table' in a multi-replica setup." for zk_listing_table_info
Loading

Class Diagram of Changes in clean_object_storage.py and Related Components

classDiagram
    class clean_object_storage_module {
      <<Python Module>>
      +ZK_PATH_LISTING_TABLE_INFO : string
      +clean(ctx, use_saved_list, keep_paths, ...)
      -_clean_object_storage(ctx, ch_client, user_name, user_password, ...)
      -_traverse_object_storage(ctx, listing_table, user_password, ...)
      -_insert_listing_batch(ctx, obj_paths_batch, listing_table, user_password)
      -_drop_table(ctx, table_name, user_password)
    }
    note for clean_object_storage_module "Functions updated to use Query objects for secure credential handling and ZooKeeper for multi-replica coordination."

    class Query {
      <<Value Object>>
      +statement : string
      +sensitive_args : map
      +Query(statement: string, sensitive_args: map)
    }
    note for Query "New abstraction for SQL queries, separating sensitive arguments like passwords."

    class ch_tools.common.clickhouse.client {
      <<Utility Module>>
      +execute_query(ctx, query_or_string, sensitive_args, format_): any
    }
    note for ch_tools.common.clickhouse.client "execute_query is used to run SQL, now handles Query objects."

    class ch_tools.chadmin.internal.zookeeper {
      <<Utility Module>>
      +has_zk(ctx) : bool
      +check_zk_node(ctx, path) : bool
      +create_zk_nodes(ctx, paths, make_parents)
      +update_zk_nodes(ctx, path, data_bytes)
      +get_zk_node(ctx, path) : bytes
      +delete_zk_node(ctx, path)
    }
    note for ch_tools.chadmin.internal.zookeeper "Provides ZooKeeper interaction helpers, including new has_zk function."

    class ch_tools.common.clickhouse.client.ClickhouseClient {
        +user: string
        +password: string
        +host: string
        +check_port(ClickhousePort): bool
    }
    note for ch_tools.common.clickhouse.client.ClickhouseClient "Used to fetch credentials, host, and check port for remoteSecure selection."

    clean_object_storage_module ..> Query : creates >
    clean_object_storage_module ..> ch_tools.common.clickhouse.client : uses execute_query >
    clean_object_storage_module ..> ch_tools.chadmin.internal.zookeeper : uses ZK functions >
    clean_object_storage_module ..> ch_tools.common.clickhouse.client.ClickhouseClient : uses >
Loading

File-Level Changes

Change Details Files
Securely manage listing table lifecycle using Query abstraction
  • Wrap CREATE, INSERT, and DROP statements in Query with sensitive_args
  • Propagate user_password into create, insert, and drop functions
  • Update _drop_table signature to accept user_password
ch_tools/common/commands/clean_object_storage.py
Coordinate listing table ownership across replicas with ZooKeeper
  • Define ZK_PATH_LISTING_TABLE_INFO constant
  • Detect multi-replica clusters and create/check zk node
  • Store and retrieve replica_owner in ZooKeeper
  • Delete zk node when dropping the listing table
ch_tools/common/commands/clean_object_storage.py
ch_tools/chadmin/internal/zookeeper.py
Automatically select appropriate remote engine based on port
  • Check for secure or non-secure TCP port to set remote_clause
  • Raise error if neither port is defined
  • Use remote_clause when referencing listing_table in multi-replica mode
ch_tools/common/commands/clean_object_storage.py
Add placeholder version file
  • Introduce version.txt in repo root
version.txt

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@NikitaUnisikhin NikitaUnisikhin force-pushed the orp_objects_listing_table_fix branch from d70b867 to ded187e Compare May 29, 2025 09:47
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @NikitaUnisikhin - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

  • Potential SQL injection from unescaped object paths (link)
Here's what I looked at during the review
  • 🟡 General issues: 4 issues found
  • 🟢 Security: all looks good
  • 🟢 Review instructions: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

listing_table_info = get_zk_node(ctx, ZK_PATH_LISTING_TABLE_INFO)
data = json.loads(listing_table_info)
replica_owner = data["replica_owner"]
listing_table = f"{remote_clause}('{replica_owner}', {listing_table}, '{user_name}', '{{user_password}}')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Incorrect quoting and interpolation in remote() function call

Wrap the table name in quotes and use single braces for password interpolation: f"{remote_clause}('replica_owner', '{listing_table}', '{user_name}', '{user_password}')".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in remote_data_paths_table the same is done and everything works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get why it seems to work in your remote_data_paths_table helper today, but the way you’re constructing the SQL here is still going to bite you in two ways:

  1. You never wrap the fully‐qualified table name in quotes, so if your database or table names ever contain letters that ClickHouse treats as identifiers (hyphens, dots, uppercase letters, etc.) you’ll blow up at runtime.
  2. By using {{user_password}} inside an f-string you’re literally embedding the braces into the SQL, not your password.

Even if it “works” today, it’s fragile and leaves you open to SQL-injection between your Python interpolation and the ClickHouse parser. I’d recommend you switch to the same pattern we use everywhere else—single braces for f-strings and quoted identifiers—and, ideally, wrap the whole thing in a Query(…, sensitive_args=…) so you never accidentally leak a password into your logs. For example:

# build a properly quoted identifier for the table
qualified_table = f"{config['listing_table_database']}.`{config['listing_table_prefix']}{disk_conf.name}`"

# then construct the remote() call with single braces around the password
listing_table = (
    f"{remote_clause}('{replica_owner}', '{qualified_table}', "
    f"'{user_name}', '{user_password}')"
)

# if you’re using our Query wrapper you could do:
listing_table_query = Query(
    listing_table,
    sensitive_args={"user_password": user_password}
)

This way you’re guaranteed correct quoting and you never accidentally leave your password in a log or stack trace. Can you give that a try?

@NikitaUnisikhin NikitaUnisikhin force-pushed the orp_objects_listing_table_fix branch 3 times, most recently from 475b5c4 to c26049b Compare May 29, 2025 12:00
@NikitaUnisikhin NikitaUnisikhin force-pushed the orp_objects_listing_table_fix branch 3 times, most recently from 736b146 to ebe2678 Compare May 29, 2025 13:21
@NikitaUnisikhin NikitaUnisikhin force-pushed the orp_objects_listing_table_fix branch from ebe2678 to 0ff237c Compare May 29, 2025 13:32
@NikitaUnisikhin NikitaUnisikhin requested a review from aalexfvk May 29, 2025 13:42
@NikitaUnisikhin NikitaUnisikhin changed the title [WIP] Actual listing table for S3 orphaned objects monitoring Actual listing table for S3 orphaned objects monitoring May 29, 2025
from ch_tools.monrun_checks.clickhouse_info import ClickhouseInfo


def has_zk(ctx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add type hints for all new code

DEFAULT_GUARD_INTERVAL = "24h"
KEYS_BATCH_SIZE = 100_000
# The ZK node along this path contains information about listing table.
LISTING_TABLE_INFO_ZK_PATH = "/object-storage-cleanup-state/listing-table-info"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving state_zk_path to the config? if it is present has_zk will not needed anymore.

# The ZK node along this path contains information about listing table.
LISTING_TABLE_INFO_ZK_PATH = "/object-storage-cleanup-state/listing-table-info"
# Field of the replica on which the listing table is located.
LISTING_TABLE_REPLICA_OWNER_FIELD = "replica_owner"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extend OrphanedObjectsState ?

listing_table_info = get_zk_node(ctx, LISTING_TABLE_INFO_ZK_PATH)
info = json.loads(listing_table_info)
replica_owner = info[LISTING_TABLE_REPLICA_OWNER_FIELD]
listing_table = f"{_get_remote_clause(ctx)}('{replica_owner}', {listing_table}, '{user_name}', '{{user_password}}')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use remote only if we use --use-saved-list. Otherwise we will insert in remote() I'm not sure it is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that idea was we use remote during the week for invokings with --use-saved-list, but once a week, replicas should under lock decide who does full listing and write this into zk-state. And again all --use-saved-list invokings use remote to replicas that last made the listing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载