- WordPress Import Package (wp-import)
A Go package for parsing, analyzing, and converting WordPress export XML files. This library provides robust functionality for accessing WordPress content programmatically and transforming it into various formats.
- Complete WordPress XML Parsing: Parse WordPress export files into structured Go types
- Content Format Conversion:
- Convert WordPress HTML content to clean plain text
- Convert WordPress HTML content to properly formatted Markdown
- Sanitize WordPress content (remove Gutenberg blocks, clean HTML)
- WordPress Data Analysis:
- Extract metadata from posts and pages
- Analyze custom post types, plugins, and themes
- Gather information about media attachments
- Advanced Content Processing:
- Handle lists (ordered and unordered) with proper formatting:
- Ordered lists convert to numbered points (1., 2., etc.)
- Unordered lists convert to bullet points (•)
- Proper spacing before and after lists
- Convert HTML tables to Markdown tables with header separators
- Proper formatting of headings, code blocks, blockquotes, and images
- Support for inline formatting (bold, italic, strikethrough)
- Handle lists (ordered and unordered) with proper formatting:
go get github.com/boomhut/wp-import
- Go 1.24 or higher
- No external dependencies outside the Go standard library
This package is compatible with:
- Standard WordPress export files (WXR format)
- WordPress exports from version 4.0 and newer
- Both single-site and multisite exports
package main
import (
"fmt"
"log"
"github.com/boomhut/wp-import"
)
func main() {
// Parse WordPress export file
site, err := wpimport.ParseWordPressXML("wordpress-export.xml")
if err != nil {
log.Fatalf("Error parsing WordPress XML: %v", err)
}
// Display basic site info
fmt.Printf("Site Title: %s\n", site.Channel.Title)
fmt.Printf("Site URL: %s\n", site.Channel.Link)
// Get posts by type
posts := site.GetPostsByType("post")
fmt.Printf("Found %d posts\n", len(posts))
// Convert content to Markdown
if len(posts) > 0 {
markdown := wpimport.ConvertToMarkdown(posts[0].Content)
fmt.Printf("Markdown preview: %s...\n", truncateString(markdown, 200))
}
}
func truncateString(s string, maxLen int) string {
if len(s) <= maxLen {
return s
}
return s[:maxLen] + "..."
}
ParseWordPressXML(filename string) (*WordPressSite, error)
- Parse WordPress export XML fileParseWordPressDate(dateStr string) (time.Time, error)
- Parse WordPress date format
SanitizeWordPressContent(content string) string
- Clean up WordPress contentCleanHTML(content string) string
- Sanitize HTML while preserving HTML structureConvertToPlainText(content string) string
- Convert HTML content to plain textConvertToMarkdown(content string) string
- Convert HTML content to MarkdownconvertOrderedListsToPlainText(content string) string
- Convert<ol>
lists to numbered plain textconvertUnorderedListsToPlainText(content string) string
- Convert<ul>
lists to bullet pointsconvertOrderedLists(content string) string
- Convert<ol>
lists to Markdown formatconvertUnorderedLists(content string) string
- Convert<ul>
lists to Markdown formatconvertTables(content string) string
- Convert HTML tables to Markdown table format
GetPostsByType(postType string) []Item
- Get posts of a specific typeGetPublishedPosts() []Item
- Get only published postsGetPostByID(id int) *Item
- Find a post by IDGetAuthors() []Author
- Get all authorsGetCustomTerms() []Term
- Get custom taxonomy termsGetAttachmentURLs() []string
- Get all attachment URLs
AnalyzePluginData() map[string]interface{}
- Analyze plugin usageGetCustomStyles() map[string]string
- Extract custom CSS and stylesGetPageBuilderData() map[string]interface{}
- Analyze page builder usageGetThemeInfo() map[string]string
- Extract theme information
The package provides comprehensive types that map to WordPress export structures:
WordPressSite
- Root structure for the WordPress exportChannel
- Contains site information and all content itemsAuthor
- WordPress user account informationItem
- Post, page, or other content typeCategory
- WordPress categoryTag
- WordPress tagTerm
- Custom taxonomy termPostMeta
- Custom fields and metadataComment
- Post commentCommentMeta
- Comment metadata
htmlContent := `<p>This is a <strong>paragraph</strong> with <em>formatting</em>.</p>
<ul>
<li>Bullet point 1</li>
<li>Bullet point 2</li>
</ul>
<ol>
<li>First ordered item</li>
<li>Second ordered item</li>
</ol>`
plainText := wpimport.ConvertToPlainText(htmlContent)
fmt.Println(plainText)
Output:
This is a paragraph with formatting.
• Bullet point 1
• Bullet point 2
1. First ordered item
2. Second ordered item
htmlContent := `<h1>Heading</h1>
<p>This is a <strong>paragraph</strong> with <em>formatting</em>.</p>
<ul>
<li>Bullet point</li>
<li>Another bullet point</li>
</ul>
<ol>
<li>First item</li>
<li>Second item</li>
</ol>
<blockquote>This is a blockquote</blockquote>
<pre><code>function example() {
return "This is a code block";
}</code></pre>
<table>
<tr>
<th>Header 1</th>
<th>Header 2</th>
</tr>
<tr>
<td>Cell 1</td>
<td>Cell 2</td>
</tr>
</table>`
markdown := wpimport.ConvertToMarkdown(htmlContent)
fmt.Println(markdown)
Output (after conversion to Markdown):
# Heading
This is a **paragraph** with *formatting*.
- Bullet point
- Another bullet point
1. First item
2. Second item
> This is a blockquote
<!-- Code block converted from HTML -->
function example() { return "This is a code block"; }
| Header 1 | Header 2 |
| --- | --- |
| Cell 1 | Cell 2 |
htmlContent := `<p>This is <b>bold</b> text with a <font color="red">colored font</font> tag.</p>
<center>This text is centered</center>
<p class="wp-block-paragraph aligncenter" style="">Paragraph with WordPress classes</p>
<div data-wp-block="true" data-align="wide">Block with data attributes</div>
<ul><li>Bullet</li><li>Another <strong>bullet</strong> with <em>formatting</em></li></ul>`
cleanHTML := wpimport.CleanHTML(htmlContent)
fmt.Println(cleanHTML)
Output (after cleaning):
<p>This is <strong>bold</strong> text with a <span>colored font</span> tag.</p>
<div style="text-align: center;">This text is centered</div>
<p>Paragraph with WordPress classes</p>
<div>Block with data attributes</div>
<ul>
<li>Bullet</li>
<li>Another <strong>bullet</strong> with <em>formatting</em></li>
</ul>
Memory usage is optimized for processing large WordPress exports:
- Parsing a 100MB WordPress export uses approximately 200-300MB RAM
- Converting 1000 posts to Markdown (average 10KB each) uses approximately 50-100MB RAM
For optimal performance:
- Process posts in batches or use goroutines for parallel processing
- For extremely large exports, consider splitting the XML file
- Use the conversion functions directly on individual posts rather than processing the entire content at once
The package includes specialized helper functions for properly converting different types of HTML lists:
// Helper function to convert ordered lists to numbered plain text
func convertOrderedListsToPlainText(content string) string {
// Extracts each list item and numbers them (1., 2., etc.)
// Preserves list item contents while stripping HTML
}
// Helper function to convert unordered lists to bullet point plain text
func convertUnorderedListsToPlainText(content string) string {
// Extracts each list item and adds bullet points (•)
// Preserves list item contents while stripping HTML
}
// Helper function to convert ordered lists to markdown
func convertOrderedLists(content string) string {
// Extracts and converts ordered lists to markdown format
// Adds proper spacing before and after lists
// Ensures correct numbering (1., 2., etc.)
}
// Helper function to convert unordered lists to markdown
func convertUnorderedLists(content string) string {
// Extracts and converts unordered lists to markdown format
// Uses proper markdown bullet point style (-)
// Adds proper spacing before and after lists
}
// Helper function to convert HTML tables to markdown format
func convertTables(content string) string {
// Processes HTML tables and converts them to markdown tables
// Creates header row with separator
// Handles cell content and escapes pipe characters
// Maintains proper alignment and spacing
}
These functions work by using regular expressions to locate structured elements in the HTML content, extract their components, and format them according to the target format. They handle proper spacing and ensure that nested content is correctly processed.
The CleanHTML
function provides a way to sanitize WordPress HTML content while preserving its structure:
-
Remove WordPress-specific Elements:
- Removes Gutenberg block comments
- Cleans out empty paragraphs and unnecessary whitespace
-
Fix HTML Structure Issues:
- Repairs unclosed or improperly nested tags
- Fixes malformed list structures
- Ensures proper HTML structure is maintained
-
Modernize HTML:
- Updates deprecated tags to modern HTML5 equivalents
- Converts
<center>
to styled divs - Converts
<font>
to spans - Converts
<b>
to<strong>
and<i>
to<em>
-
Clean Up Attributes:
- Removes empty and WordPress-specific attributes
- Cleans up unnecessary styling attributes
- Removes data attributes that are WordPress-specific
The ConvertToPlainText
function works through these steps:
- Sanitize: First, the WordPress content is cleaned up to remove non-standard HTML
- List Processing:
- Ordered lists (
<ol>
) are converted to numbered text (1., 2., etc.) - Unordered lists (
<ul>
) are converted to bullet points (•)
- Ordered lists (
- Tag Processing:
<br>
tags are replaced with newlines</p>
tags are replaced with double newlines- All other HTML tags are removed
- Entity Decoding: All HTML entities are decoded (e.g.,
&
to&
) - Whitespace Cleanup: Excessive whitespace is normalized
The ConvertToMarkdown
function follows a more comprehensive process:
- Sanitize Content: Remove WordPress-specific HTML and clean up
- Process Block Elements:
- Lists are converted first to avoid interference with other processing
- Tables are processed into properly formatted Markdown tables
- Headers, blockquotes, and code blocks are converted
- Process Inline Elements:
- Process links, images, and inline formatting
- Handle text formatting (bold, italic, strikethrough)
- Final Cleanup:
- Remove any remaining HTML tags
- Decode HTML entities
- Normalize whitespace
- Ensure proper spacing between elements
The conversion relies on carefully crafted regular expressions:
- Multiline matching: Uses
(?s)
flag to match across newlines - Non-greedy matching: Uses
.*?
to avoid over-capturing - Attribute-aware matching: Handles variations in HTML tag attributes
You can combine the package's functions for specialized conversion needs:
// First sanitize, then perform custom transformations, then convert
content := wpimport.SanitizeWordPressContent(htmlContent)
// Apply your custom transformations to content
markdown := wpimport.ConvertToMarkdown(content)
For performance when processing many posts:
site, err := wpimport.ParseWordPressXML("wordpress-export.xml")
if err != nil {
log.Fatal(err)
}
// Process posts in parallel with goroutines
var wg sync.WaitGroup
posts := site.GetPostsByType("post")
for _, post := range posts {
wg.Add(1)
go func(p wpimport.Item) {
defer wg.Done()
// Process the post content
markdown := wpimport.ConvertToMarkdown(p.Content)
// Do something with the markdown...
}(post)
}
wg.Wait()
If you encounter errors parsing very large WordPress exports:
// For large files, try increasing buffers
site, err := wpimport.ParseWordPressXML("large-wordpress-export.xml")
if err != nil {
if strings.Contains(err.Error(), "token too large") {
// Large file handling - split the XML or process in chunks
log.Println("XML file too large, consider splitting")
}
log.Fatal(err)
}
For very large WordPress sites:
// Process posts in batches to manage memory
posts := site.GetPostsByType("post")
batchSize := 50
totalPosts := len(posts)
for i := 0; i < totalPosts; i += batchSize {
end := i + batchSize
if end > totalPosts {
end = totalPosts
}
batch := posts[i:end]
// Process this batch of posts
processPostBatch(batch)
}
If your WordPress content has complex shortcodes:
// Apply more aggressive shortcode removal before conversion
content = wpimport.SanitizeWordPressContent(post.Content)
// Additional shortcode handling if needed
content = regexp.MustCompile(`\[.*?\]`).ReplaceAllString(content, "")
markdown := wpimport.ConvertToMarkdown(content)
Contributions are welcome! Please feel free to submit a Pull Request.
If you encounter any bugs or have feature requests, please open an issue on the GitHub repository.
MIT License. See the LICENSE file for details.