-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Let's use the data processing libraries maintained in this repository to improve the official WordPress WXR importer.
Goals
- Short-term: Improve the WXR importing experience in a succession of small, frequent, meaningful changes to the existing wordpress-importer plugin.
- Long-term: Ship a selective, fast, live WordPress <-> WordPress sync.
- Anti-goal: Get stuck for two months on rewriting the entire existing importer plugin from scratch before shipping the first user-facing change.
The current state of the WordPress importer
I've mapped out some features and shortcomings of the existing WordPress importer. The proposed roadmap below is based on this research:
Draft roadmap – let's discuss
1. Testability
The wordpress-importer
implementation is a mixture of wp-admin UI rendering, input processing, and importing logic.
The https://github.com/WordPress/wordpress-importer repository has some unit tests. Let's investigate how can we leverage and expand that test harness to make changes without breaking existing use-cases.
2. Compatibility with all hosts
The wordpress-importer
plugin relies on libxml and won't work on some hosts. That's quite restrictive – let's make it compatible with all the hosts!
- Bump version to 0.9.0 wordpress-importer#200
- Use WXR_Parser_XML_Processor as the fallback instead of WXR_Parser_Regex wordpress-importer#199
- Run E2E tests for every parser, including WXR_Parser_XML_Processor wordpress-importer#197
- Add E2E tests for importing WXR files wordpress-importer#195
- WXR_Parser_XML_Processor: An all-PHP WXR parser wordpress-importer#190
3. URL rewriting
wordpress-importer
does not rewrite absolute URLs in the imported posts and comments. As a result, the imported data often contains broken links to a source site.
- Expose hooks for plugin authors to rewrite URLs in their content, e.g. custom post types, custom tables, base64-encoded fragments of regular posts and pages.
- Refresh the public suffix list periodically
- Enable or disable URL rewriting with a checkbox wordpress-importer#207
- Feature: URL rewriting in the imported content wordpress-importer#202
- Bump minimum PHP version to 7.2 wordpress-importer#196
4. Add a streaming data flow in the wordpress-importer
plugin
Existing filters, such as wp_import_categories
, assume the entire import context is stored in memory. This isn't a viable approach for processing larger datasets. To support them, we need to break BC on the existing filters without breaking the extenders of those filters.
One way of doing that would be forking the importer plugin and removing those filters. But forking is challenging – the usage drops, changes needs to be backported, codebase becomes fragmented. There's an easier way.
Let's create a second, experimental streaming data flow in the importer plugin. By default, WXR files would still be processed by the existing machinery. When a flag or a checkbox is checked, we'd switch to a streaming processor that imports one chunk at a time and can recover from timeouts and OOM errors.
- Split the data flows to stream-process files. Do not use any existing filters and hooks in the new data flow.
5. Naive large file support
wordpress-importer
is unable to process large files due to two constraints: PHP request timeout and the memory limit. Let's break out of those by supporting a re-entrant, multi-request importing flow. First, we wouldn't store everything in memory. Second, we'd know how to pause the import process and resume it later.
- Store the user/post/etc. mapping data in the database, not in the memory
- Store the current import cursor
- Support resuming the import from a cursor
Cases explicitly not covered at this stage:
- Unsorted WXR files where parent posts come after their children
6. Fast, concurrent assets download
The wordpress-importer
plugin is downloading all the remote assets one-by-one. It's slow! Let's parallelize those downloads and fetch, say, up to 10 files concurrently at any given time.
7. Future
The above points will take some time to implement already. Here's some items that would be good to look into afterwards:
- Disable filters for the duration of the import to prevent, e.g., sending emails
- More data formats, e.g. Markdown, HTML
- More data sources, e.g. WXR URL, Git repo, another WordPress site, an arbitrary URL
- UI improvements, e.g. dropzone, progress bar, detailed import log and statistics
- Error recovery, e.g. "5 media files couldn't be fetched, do you want to retry? ignore them? upload alternative files?
- Real-time importing from another source, e.g. publishing a post on site A auto-publishes it on site B
Prior art
Here's the landscape of WordPress WXR importers out in the wild.
WXR importers:
- Humanmade WordPress-importer – a fork of the
wordpress-importer
plugin - pbiron/wordpress-importer-v2 - last updated 8 years ago
- WP-CLI import command – it wraps the WP_Import class by overriding the
$_POST
etc. superglobals and not showing any rendered HTML in the CLI output.
Other WordPress migration products (selected few):
- https://wordpress.org/plugins/duplicator/
- https://wordpress.org/plugins/updraftplus/
- https://wordpress.org/plugins/all-in-one-wp-migration/
- https://wordpress.org/plugins/wp-staging/
- https://wordpress.org/plugins/backup-backup/
- https://wordpress.org/plugins/migrate-guru/
- https://wordpress.org/plugins/wp-migration-duplicator/
- https://wordpress.org/plugins/bv-pantheon-migration/
Other, related discussions: