+
Skip to content

Improve the official WordPress WXR importer #138

@adamziel

Description

@adamziel

Let's use the data processing libraries maintained in this repository to improve the official WordPress WXR importer.

Goals

  • Short-term: Improve the WXR importing experience in a succession of small, frequent, meaningful changes to the existing wordpress-importer plugin.
  • Long-term: Ship a selective, fast, live WordPress <-> WordPress sync.
  • Anti-goal: Get stuck for two months on rewriting the entire existing importer plugin from scratch before shipping the first user-facing change.

The current state of the WordPress importer

I've mapped out some features and shortcomings of the existing WordPress importer. The proposed roadmap below is based on this research:

Image

Draft roadmap – let's discuss

1. Testability

The wordpress-importer implementation is a mixture of wp-admin UI rendering, input processing, and importing logic.
The https://github.com/WordPress/wordpress-importer repository has some unit tests. Let's investigate how can we leverage and expand that test harness to make changes without breaking existing use-cases.

2. Compatibility with all hosts

The wordpress-importer plugin relies on libxml and won't work on some hosts. That's quite restrictive – let's make it compatible with all the hosts!

3. URL rewriting

wordpress-importer does not rewrite absolute URLs in the imported posts and comments. As a result, the imported data often contains broken links to a source site.

4. Add a streaming data flow in the wordpress-importer plugin

Existing filters, such as wp_import_categories, assume the entire import context is stored in memory. This isn't a viable approach for processing larger datasets. To support them, we need to break BC on the existing filters without breaking the extenders of those filters.

One way of doing that would be forking the importer plugin and removing those filters. But forking is challenging – the usage drops, changes needs to be backported, codebase becomes fragmented. There's an easier way.

Let's create a second, experimental streaming data flow in the importer plugin. By default, WXR files would still be processed by the existing machinery. When a flag or a checkbox is checked, we'd switch to a streaming processor that imports one chunk at a time and can recover from timeouts and OOM errors.

  • Split the data flows to stream-process files. Do not use any existing filters and hooks in the new data flow.

5. Naive large file support

wordpress-importer is unable to process large files due to two constraints: PHP request timeout and the memory limit. Let's break out of those by supporting a re-entrant, multi-request importing flow. First, we wouldn't store everything in memory. Second, we'd know how to pause the import process and resume it later.

  • Store the user/post/etc. mapping data in the database, not in the memory
  • Store the current import cursor
  • Support resuming the import from a cursor

Cases explicitly not covered at this stage:

  • Unsorted WXR files where parent posts come after their children

6. Fast, concurrent assets download

The wordpress-importer plugin is downloading all the remote assets one-by-one. It's slow! Let's parallelize those downloads and fetch, say, up to 10 files concurrently at any given time.

7. Future

The above points will take some time to implement already. Here's some items that would be good to look into afterwards:

  • Disable filters for the duration of the import to prevent, e.g., sending emails
  • More data formats, e.g. Markdown, HTML
  • More data sources, e.g. WXR URL, Git repo, another WordPress site, an arbitrary URL
  • UI improvements, e.g. dropzone, progress bar, detailed import log and statistics
  • Error recovery, e.g. "5 media files couldn't be fetched, do you want to retry? ignore them? upload alternative files?
  • Real-time importing from another source, e.g. publishing a post on site A auto-publishes it on site B

Prior art

Here's the landscape of WordPress WXR importers out in the wild.

WXR importers:

Other WordPress migration products (selected few):

Other, related discussions:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载