-
-
Notifications
You must be signed in to change notification settings - Fork 558
Add Wayback Machine URL archiver and replacer script #2504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Important Review skippedMore than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review. 59 files out of 166 files are above the max files limit of 100. Please upgrade to Pro plan to get higher limits. You can disable this status message by setting the ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
This commit introduces a new CLI script that: - Recursively scans the `holidays` package for `.py` and `.po` files, ignoring `__pycache__` - Extracts all HTTP(S) links, filtering out known domains (e.g., GitHub, Wikipedia, Python docs) - Queries the Wayback Machine CDX API for existing captures - Submits URLs to the Wayback Machine Save API when needed - Supports three archive policies: • `if-missing` (default): archive only when no capture exists • `always`: always submit for archiving, even if captures exist • `never`: only lookup existing captures, do not archive - Replaces original URLs in source files with their Wayback snapshots - Uses a retrying `requests` session with exponential backoff for robustness - Prints progress summaries and warning messages for files that cannot be read or written This script makes it easy to freeze external references within the codebase, ensuring all links remain valid over time.
ebeae2b
to
383be6c
Compare
There are still 41 files with ~177 URLs that could not be archived by Internet Archive. Some of these URLs might be no longer accessible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked up to letter N for now
Co-authored-by: Panpakorn Siripanich <19505219+PPsyrius@users.noreply.github.com> Signed-off-by: Kriti Birda <164247895+kritibirda26@users.noreply.github.com>
Co-authored-by: Panpakorn Siripanich <19505219+PPsyrius@users.noreply.github.com> Signed-off-by: Kriti Birda <164247895+kritibirda26@users.noreply.github.com>
Signed-off-by: Kriti Birda <164247895+kritibirda26@users.noreply.github.com>
Remove tiny.cc source link aliases, Thailand sources archive work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did look into SonarQube's error suppression a bit, seems like they only got # NOSONAR
as in-line global issue suppression tool with no option to only disable specific rule like mypy
or other tools...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Sure, I'll review it this week 👍 |
Co-authored-by: Panpakorn Siripanich <19505219+PPsyrius@users.noreply.github.com> Signed-off-by: Kriti Birda <164247895+kritibirda26@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kritibirda26 great work -- both idea and implementation 👏
I don't want to be a blocker here just because of readability and best practices suggestions. Moreover, refactoring w/o tests is a bit tricky. So I'll just provide a general feedback you might use later:
- if you need just a sequence prefer using tuples instead of lists
- use consistent naming (ignore vs ignored)
- use Path instead of os.path
- if dict (or other) params order doesn't matter -- order alphabetically
- use spellcheck locally (e.g. your IDE plugin)
Signed-off-by: Arkadii Yakovets <2201626+arkid15r@users.noreply.github.com>
|
Proposed change
Add Wayback Machine URL archiver and replacer script
holidays
package for.py
and.po
files, ignoring__pycache__
•
if-missing
(default): archive only when no capture exists•
always
: always submit for archiving, even if captures exist•
never
: only lookup existing captures, do not archiverequests
session with exponential backoff for robustnessThis script makes it easy to freeze external references within the codebase, ensuring all links remain valid over time.
Fix #2467.
Type of change
holidays
functionality in general)Checklist
make check
, all checks and tests are green