+
Skip to content

Conversation

kirillgarbar
Copy link
Contributor

@kirillgarbar kirillgarbar commented Oct 3, 2025

Summary by Sourcery

Save object creation times and enhance object-storage cleaning to support time-based filtering and date-partitioned statistics.

New Features:

  • Capture and store object last_modified timestamps in listing and orphaned objects tables.
  • Allow cleanup of orphaned objects with time range filtering and reuse of saved lists via --use-saved-list.
  • Introduce --stat-partitioning option to partition cleanup statistics by day, month, or show only total.
  • Update chadmin object-storage clean to output YAML stats keyed by partition dates.

Enhancements:

  • Refactor cleanup logic to aggregate deletion stats in a ResultStatDict instead of simple tuples.
  • Add StatisticsPartitioning enum with helper functions for per-partition stats aggregation.
  • Update object list generator and SQL schemas to include last_modified filtering and ordering.
  • Improve templating by trimming and stripping whitespace in rendered queries.

Tests:

  • Add feature tests for use-saved-list option and stat-partitioning scenarios.
  • Normalize existing object storage feature tests indentation.

Copy link
Contributor

sourcery-ai bot commented Oct 3, 2025

Reviewer's Guide

This PR augments the object-storage clean/collect-info workflow by persisting and filtering on object creation timestamps, refactors clean to return structured statistics, and enables partitioned output stats by date across CLI, backend logic and tests.

ER diagram for updated object listing table schema

erDiagram
    OBJECT_LISTING_TABLE {
        DateTime last_modified
        String obj_path
        UInt64 obj_size
    }
    OBJECT_LISTING_TABLE ||--o| ORPHANED_OBJECTS_TABLE : "obj_path"
    ORPHANED_OBJECTS_TABLE {
        DateTime last_modified
        String obj_path
        UInt64 obj_size
    }
Loading

Class diagram for updated ObjListItem and statistics partitioning

classDiagram
    class ObjListItem {
        +datetime time
        +str path
        +int size
        +from_tab_separated(value: str) ObjListItem
    }
    class StatisticsPartitioning {
        <<enum>>
        +DAY
        +MONTH
        +ALL
    }
    class ResultStatDict {
        <<dict[str, dict[str, int]>>
    }
    ObjListItem --> StatisticsPartitioning : used for partitioning stats
    StatisticsPartitioning --> ResultStatDict : determines keys
Loading

File-Level Changes

Change Details Files
Include last_modified timestamp in object listings and tables
  • Select last_modified in orphaned-objects and listing queries
  • Extend listing table schema and INSERT to include last_modified
  • Adapt ObjListItem to parse and carry a time field
ch_tools/common/commands/object_storage.py
ch_tools/chadmin/internal/object_storage/obj_list_item.py
Apply time-based filters to object list generator
  • Extend _object_list_generator to accept from_time/to_time
  • Inject Jinja conditions for last_modified bounds in SELECT
ch_tools/common/commands/object_storage.py
Add statistics partitioning to clean command
  • Define StatisticsPartitioning enum and result_stat dict
  • Update cleanup_s3_object_storage and _update_stat to accumulate per-partition stats
  • Introduce --stat-partitioning option in CLI and propagate to clean
ch_tools/chadmin/internal/object_storage/s3_cleanup.py
ch_tools/chadmin/cli/object_storage_group.py
Refactor clean functions to return structured stats
  • Change clean() and _clean_object_storage() signatures to return ResultStatDict
  • Adjust callers and print_response to handle result_stat instead of tuple
ch_tools/common/commands/object_storage.py
ch_tools/chadmin/cli/object_storage_group.py
Update tests for saved-list and stat-partitioning scenarios
  • Add Gherkin scenarios for use-saved-list and stat-partitioning
  • Fix indentation of WouldDelete/Deleted lines in existing scenarios
tests/features/object_storage.feature
Standardize timestamp formatting and Jinja templating
  • Introduce DATETIME_FORMAT constant for consistent parsing/formatting
  • Use DATETIME_FORMAT in format_timestamp and parsing utilities
  • Enable trim_blocks and lstrip_blocks in Jinja Environment
ch_tools/chadmin/internal/utils.py
ch_tools/common/cli/formatting.py
ch_tools/common/clickhouse/client/clickhouse_client.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected direct use of jinja2. If not done properly, this may bypass HTML escaping which opens up the application to cross-site scripting (XSS) vulnerabilities. Prefer using the Flask method 'render_template()' and templates with a '.html' extension in order to prevent XSS. (link)
  • Detected a Jinja2 environment without autoescaping. Jinja2 does not autoescape by default. This is dangerous if you are rendering to a browser because this allows for cross-site scripting (XSS) attacks. If you are in a web context, enable autoescaping by setting 'autoescape=True.' You may also consider using 'jinja2.select_autoescape()' to only enable automatic escaping for certain file extensions. (link)

General comments:

  • The listing table schema upgrade introduces a new last_modified column and changes the ORDER BY clause, so consider adding a migration step or version guard to avoid breaking existing tables.
  • You’re calling datetime.now() directly in _object_list_generator, which can make tests non-deterministic—consider injecting the current time or passing it in as a parameter for easier testing.
  • The --stat-partitioning flag is parsed as a string but clean() expects a StatisticsPartitioning enum, so add an explicit conversion or use an enum-based click type to ensure correct values are passed through.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The listing table schema upgrade introduces a new last_modified column and changes the ORDER BY clause, so consider adding a migration step or version guard to avoid breaking existing tables.
- You’re calling datetime.now() directly in _object_list_generator, which can make tests non-deterministic—consider injecting the current time or passing it in as a parameter for easier testing.
- The --stat-partitioning flag is parsed as a string but clean() expects a StatisticsPartitioning enum, so add an explicit conversion or use an enum-based click type to ensure correct values are passed through.

## Individual Comments

### Comment 1
<location> `ch_tools/chadmin/internal/object_storage/obj_list_item.py:13` </location>
<code_context>
     Item of object storage listing.
     """

+    time: datetime
     path: str
     size: int
</code_context>

<issue_to_address>
**suggestion:** Consider using more descriptive attribute names for clarity.

Renaming 'time' to 'last_modified' would align with existing terminology and enhance consistency across the codebase and schema.

Suggested implementation:

```python
    last_modified: datetime
    path: str
    size: int

```

```python
        time_str, path, size = value.split("\t")
        last_modified = datetime.strptime(time_str, DATETIME_FORMAT)
        return cls(last_modified, path, int(size))

```
</issue_to_address>

### Comment 2
<location> `ch_tools/chadmin/internal/object_storage/s3_cleanup.py:66-69` </location>
<code_context>
+    stat["Total"]["deleted"] += 1
+    stat["Total"]["total_size"] += item.size
+
+    key = _get_stat_key_by_partitioning(
+        stat_partitioning, item.time.strftime(DATETIME_FORMAT)
+    )
+    if key != "Total":
+        _init_key_in_stat(stat, key)
+        stat[key]["deleted"] += 1
</code_context>

<issue_to_address>
**issue:** Partitioning logic may not handle unexpected date formats robustly.

The current implementation relies on a fixed date format, which may cause issues if the format varies. Please add input validation or use datetime parsing to ensure consistent partitioning.
</issue_to_address>

### Comment 3
<location> `ch_tools/common/clickhouse/client/clickhouse_client.py:265` </location>
<code_context>
        env = Environment(trim_blocks=True, lstrip_blocks=True)
</code_context>

<issue_to_address>
**security (python.flask.security.xss.audit.direct-use-of-jinja2):** Detected direct use of jinja2. If not done properly, this may bypass HTML escaping which opens up the application to cross-site scripting (XSS) vulnerabilities. Prefer using the Flask method 'render_template()' and templates with a '.html' extension in order to prevent XSS.

*Source: opengrep*
</issue_to_address>

### Comment 4
<location> `ch_tools/common/clickhouse/client/clickhouse_client.py:265` </location>
<code_context>
        env = Environment(trim_blocks=True, lstrip_blocks=True)
</code_context>

<issue_to_address>
**security (python.jinja2.security.audit.missing-autoescape-disabled):** Detected a Jinja2 environment without autoescaping. Jinja2 does not autoescape by default. This is dangerous if you are rendering to a browser because this allows for cross-site scripting (XSS) attacks. If you are in a web context, enable autoescaping by setting 'autoescape=True.' You may also consider using 'jinja2.select_autoescape()' to only enable automatic escaping for certain file extensions.

```suggestion
        env = Environment(trim_blocks=True, lstrip_blocks=True, autoescape=True)
```

*Source: opengrep*
</issue_to_address>

### Comment 5
<location> `ch_tools/chadmin/internal/object_storage/s3_cleanup.py:90-92` </location>
<code_context>
def _init_key_in_stat(stat: ResultStatDict, key: str) -> None:
    if key not in stat:
        stat[key] = {}
        stat[key]["deleted"] = 0
        stat[key]["total_size"] = 0

</code_context>

<issue_to_address>
**suggestion (code-quality):** Merge dictionary assignment with declaration [×2] ([`merge-dict-assign`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/merge-dict-assign/))

```suggestion
        stat[key] = {"deleted": 0, "total_size": 0}
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Item of object storage listing.
"""

time: datetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider using more descriptive attribute names for clarity.

Renaming 'time' to 'last_modified' would align with existing terminology and enhance consistency across the codebase and schema.

Suggested implementation:

    last_modified: datetime
    path: str
    size: int
        time_str, path, size = value.split("\t")
        last_modified = datetime.strptime(time_str, DATETIME_FORMAT)
        return cls(last_modified, path, int(size))

Comment on lines +66 to +69
key = _get_stat_key_by_partitioning(
stat_partitioning, item.time.strftime(DATETIME_FORMAT)
)
if key != "Total":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Partitioning logic may not handle unexpected date formats robustly.

The current implementation relies on a fixed date format, which may cause issues if the format varies. Please add input validation or use datetime parsing to ensure consistent partitioning.


def render_query(self, query: str, **kwargs: Any) -> str:
env = Environment()
env = Environment(trim_blocks=True, lstrip_blocks=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.flask.security.xss.audit.direct-use-of-jinja2): Detected direct use of jinja2. If not done properly, this may bypass HTML escaping which opens up the application to cross-site scripting (XSS) vulnerabilities. Prefer using the Flask method 'render_template()' and templates with a '.html' extension in order to prevent XSS.

Source: opengrep


def render_query(self, query: str, **kwargs: Any) -> str:
env = Environment()
env = Environment(trim_blocks=True, lstrip_blocks=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (python.jinja2.security.audit.missing-autoescape-disabled): Detected a Jinja2 environment without autoescaping. Jinja2 does not autoescape by default. This is dangerous if you are rendering to a browser because this allows for cross-site scripting (XSS) attacks. If you are in a web context, enable autoescaping by setting 'autoescape=True.' You may also consider using 'jinja2.select_autoescape()' to only enable automatic escaping for certain file extensions.

Suggested change
env = Environment(trim_blocks=True, lstrip_blocks=True)
env = Environment(trim_blocks=True, lstrip_blocks=True, autoescape=True)

Source: opengrep

Comment on lines +90 to +92
stat[key] = {}
stat[key]["deleted"] = 0
stat[key]["total_size"] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Merge dictionary assignment with declaration [×2] (merge-dict-assign)

Suggested change
stat[key] = {}
stat[key]["deleted"] = 0
stat[key]["total_size"] = 0
stat[key] = {"deleted": 0, "total_size": 0}

Comment on lines +175 to +181
@option(
"--stat-partitioning",
"stat_partitioning",
type=Choice(["all", "month", "day"]),
default=StatisticsPartitioning.ALL,
help=("Partition output stats by months or days, or return only total stats."),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better name offers open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载