-
Notifications
You must be signed in to change notification settings - Fork 10
Save object create time and allow to partition output stats by date #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR augments the object-storage clean/collect-info workflow by persisting and filtering on object creation timestamps, refactors clean to return structured statistics, and enables partitioned output stats by date across CLI, backend logic and tests. ER diagram for updated object listing table schemaerDiagram
OBJECT_LISTING_TABLE {
DateTime last_modified
String obj_path
UInt64 obj_size
}
OBJECT_LISTING_TABLE ||--o| ORPHANED_OBJECTS_TABLE : "obj_path"
ORPHANED_OBJECTS_TABLE {
DateTime last_modified
String obj_path
UInt64 obj_size
}
Class diagram for updated ObjListItem and statistics partitioningclassDiagram
class ObjListItem {
+datetime time
+str path
+int size
+from_tab_separated(value: str) ObjListItem
}
class StatisticsPartitioning {
<<enum>>
+DAY
+MONTH
+ALL
}
class ResultStatDict {
<<dict[str, dict[str, int]>>
}
ObjListItem --> StatisticsPartitioning : used for partitioning stats
StatisticsPartitioning --> ResultStatDict : determines keys
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
Blocking issues:
- Detected direct use of jinja2. If not done properly, this may bypass HTML escaping which opens up the application to cross-site scripting (XSS) vulnerabilities. Prefer using the Flask method 'render_template()' and templates with a '.html' extension in order to prevent XSS. (link)
- Detected a Jinja2 environment without autoescaping. Jinja2 does not autoescape by default. This is dangerous if you are rendering to a browser because this allows for cross-site scripting (XSS) attacks. If you are in a web context, enable autoescaping by setting 'autoescape=True.' You may also consider using 'jinja2.select_autoescape()' to only enable automatic escaping for certain file extensions. (link)
General comments:
- The listing table schema upgrade introduces a new last_modified column and changes the ORDER BY clause, so consider adding a migration step or version guard to avoid breaking existing tables.
- You’re calling datetime.now() directly in _object_list_generator, which can make tests non-deterministic—consider injecting the current time or passing it in as a parameter for easier testing.
- The --stat-partitioning flag is parsed as a string but clean() expects a StatisticsPartitioning enum, so add an explicit conversion or use an enum-based click type to ensure correct values are passed through.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The listing table schema upgrade introduces a new last_modified column and changes the ORDER BY clause, so consider adding a migration step or version guard to avoid breaking existing tables.
- You’re calling datetime.now() directly in _object_list_generator, which can make tests non-deterministic—consider injecting the current time or passing it in as a parameter for easier testing.
- The --stat-partitioning flag is parsed as a string but clean() expects a StatisticsPartitioning enum, so add an explicit conversion or use an enum-based click type to ensure correct values are passed through.
## Individual Comments
### Comment 1
<location> `ch_tools/chadmin/internal/object_storage/obj_list_item.py:13` </location>
<code_context>
Item of object storage listing.
"""
+ time: datetime
path: str
size: int
</code_context>
<issue_to_address>
**suggestion:** Consider using more descriptive attribute names for clarity.
Renaming 'time' to 'last_modified' would align with existing terminology and enhance consistency across the codebase and schema.
Suggested implementation:
```python
last_modified: datetime
path: str
size: int
```
```python
time_str, path, size = value.split("\t")
last_modified = datetime.strptime(time_str, DATETIME_FORMAT)
return cls(last_modified, path, int(size))
```
</issue_to_address>
### Comment 2
<location> `ch_tools/chadmin/internal/object_storage/s3_cleanup.py:66-69` </location>
<code_context>
+ stat["Total"]["deleted"] += 1
+ stat["Total"]["total_size"] += item.size
+
+ key = _get_stat_key_by_partitioning(
+ stat_partitioning, item.time.strftime(DATETIME_FORMAT)
+ )
+ if key != "Total":
+ _init_key_in_stat(stat, key)
+ stat[key]["deleted"] += 1
</code_context>
<issue_to_address>
**issue:** Partitioning logic may not handle unexpected date formats robustly.
The current implementation relies on a fixed date format, which may cause issues if the format varies. Please add input validation or use datetime parsing to ensure consistent partitioning.
</issue_to_address>
### Comment 3
<location> `ch_tools/common/clickhouse/client/clickhouse_client.py:265` </location>
<code_context>
env = Environment(trim_blocks=True, lstrip_blocks=True)
</code_context>
<issue_to_address>
**security (python.flask.security.xss.audit.direct-use-of-jinja2):** Detected direct use of jinja2. If not done properly, this may bypass HTML escaping which opens up the application to cross-site scripting (XSS) vulnerabilities. Prefer using the Flask method 'render_template()' and templates with a '.html' extension in order to prevent XSS.
*Source: opengrep*
</issue_to_address>
### Comment 4
<location> `ch_tools/common/clickhouse/client/clickhouse_client.py:265` </location>
<code_context>
env = Environment(trim_blocks=True, lstrip_blocks=True)
</code_context>
<issue_to_address>
**security (python.jinja2.security.audit.missing-autoescape-disabled):** Detected a Jinja2 environment without autoescaping. Jinja2 does not autoescape by default. This is dangerous if you are rendering to a browser because this allows for cross-site scripting (XSS) attacks. If you are in a web context, enable autoescaping by setting 'autoescape=True.' You may also consider using 'jinja2.select_autoescape()' to only enable automatic escaping for certain file extensions.
```suggestion
env = Environment(trim_blocks=True, lstrip_blocks=True, autoescape=True)
```
*Source: opengrep*
</issue_to_address>
### Comment 5
<location> `ch_tools/chadmin/internal/object_storage/s3_cleanup.py:90-92` </location>
<code_context>
def _init_key_in_stat(stat: ResultStatDict, key: str) -> None:
if key not in stat:
stat[key] = {}
stat[key]["deleted"] = 0
stat[key]["total_size"] = 0
</code_context>
<issue_to_address>
**suggestion (code-quality):** Merge dictionary assignment with declaration [×2] ([`merge-dict-assign`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/merge-dict-assign/))
```suggestion
stat[key] = {"deleted": 0, "total_size": 0}
```
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Item of object storage listing. | ||
""" | ||
|
||
time: datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Consider using more descriptive attribute names for clarity.
Renaming 'time' to 'last_modified' would align with existing terminology and enhance consistency across the codebase and schema.
Suggested implementation:
last_modified: datetime
path: str
size: int
time_str, path, size = value.split("\t")
last_modified = datetime.strptime(time_str, DATETIME_FORMAT)
return cls(last_modified, path, int(size))
key = _get_stat_key_by_partitioning( | ||
stat_partitioning, item.time.strftime(DATETIME_FORMAT) | ||
) | ||
if key != "Total": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Partitioning logic may not handle unexpected date formats robustly.
The current implementation relies on a fixed date format, which may cause issues if the format varies. Please add input validation or use datetime parsing to ensure consistent partitioning.
|
||
def render_query(self, query: str, **kwargs: Any) -> str: | ||
env = Environment() | ||
env = Environment(trim_blocks=True, lstrip_blocks=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.flask.security.xss.audit.direct-use-of-jinja2): Detected direct use of jinja2. If not done properly, this may bypass HTML escaping which opens up the application to cross-site scripting (XSS) vulnerabilities. Prefer using the Flask method 'render_template()' and templates with a '.html' extension in order to prevent XSS.
Source: opengrep
|
||
def render_query(self, query: str, **kwargs: Any) -> str: | ||
env = Environment() | ||
env = Environment(trim_blocks=True, lstrip_blocks=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.jinja2.security.audit.missing-autoescape-disabled): Detected a Jinja2 environment without autoescaping. Jinja2 does not autoescape by default. This is dangerous if you are rendering to a browser because this allows for cross-site scripting (XSS) attacks. If you are in a web context, enable autoescaping by setting 'autoescape=True.' You may also consider using 'jinja2.select_autoescape()' to only enable automatic escaping for certain file extensions.
env = Environment(trim_blocks=True, lstrip_blocks=True) | |
env = Environment(trim_blocks=True, lstrip_blocks=True, autoescape=True) |
Source: opengrep
stat[key] = {} | ||
stat[key]["deleted"] = 0 | ||
stat[key]["total_size"] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Merge dictionary assignment with declaration [×2] (merge-dict-assign
)
stat[key] = {} | |
stat[key]["deleted"] = 0 | |
stat[key]["total_size"] = 0 | |
stat[key] = {"deleted": 0, "total_size": 0} |
@option( | ||
"--stat-partitioning", | ||
"stat_partitioning", | ||
type=Choice(["all", "month", "day"]), | ||
default=StatisticsPartitioning.ALL, | ||
help=("Partition output stats by months or days, or return only total stats."), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better name offers open
Summary by Sourcery
Save object creation times and enhance object-storage cleaning to support time-based filtering and date-partitioned statistics.
New Features:
Enhancements:
Tests: