Interpreting a WordPress Export XML file

Tags: import schema XML

In preparation to writing a tool that can import most of an existing WordPress site into WebGUI (see the Kickstarter to release WebGUI Version 8) the good news is that a WordPress export file can be loaded into Perl quite easily:

#!/usr/bin/perl use XML::Simple qw(:strict); my $xs = XML::Simple->new(); my $ref = $xs->XMLin($ARGV[0], ForceArray => 1, KeyAttr => []);

Pass that script the name of your export file, and you can peruse the results with

use Data::Dumper; print Dumper($ref);

Here's what you will find:

The entirety of the file is in XML format. Happily, this means all the quoting and Unicode in it seems to be loaded correctly by XML::Simple.
The top tag describes the file as rss version 2.0
Inside that, the entirety of a website is defined in a channel tag. All the remaining tags I describe below are nested exactly one level inside the channel.
The first tags give the WordPress site's title, existing URL base link, and description.
The top pubDate is the date that the site was exported.
There are language, base_site_url, and base_blog_url tags; I believe base_site is the "WordPress Address" and base_blog is the "Site Address" as described on the General Settings page in the WP Admin.
Next will be a series of wp:author tags, each containing the author_id (WordPress reveals much of its schema relationships through its integer id fields), login name (which may differ from the display_name as shown in a post/page), email address, and first and last names
Following the authors is a list of wp:category tags which describe the categories in which WordPress posts are placed. Strangely, pages − although stored exactly the same as posts − cannot normally be given categories in WP. Perhaps the only useful data here is the cat_name field that is used as the display name for the categories. Within each post stored in the XML, a post-to-category relation is given by matching the category_nicename here to the in the post. Note, this exposes how WP's original category system was extended to be arbitrary "taxonomies" − thus the domain="category".
Then we encounter a group of wp:tag … each of which has an integer id (WP's schema relationships showing), a slug (WP's term for what you put in the http://example.com/SLUG_HERE to get the index page into that bunch of tags, or whatever; slugs are URL-safe and do not include spaces or punctuation other than dashes) and a tag_name (which can have spaces or punctuation). Similar to categories, a you will want to match post's with the slug value from here. As with categories, for some reason WP by default won't assign tags to pages, only posts.
Following the tags are the entries. Each of these contains a WordPress post, page, attachment, or custom post: details below.
That's all that's in the file.

WordPress contents

title
link − A search-engine unfriendly canonical link in the style of ?p=post_id …probably can be safely ignored
pubDate − seems unreliable
post_date − actual original date that post was first Published, in the WordPress vernacular. This is in whatever the server's local timezone was at a slightly random deviation from the actual global time.
post_date_gmt − usually zero, unreliable, or flat-out lies. WordPress, failing Codd's third normal form since 2006!
dc:creator − a lame attempt at appearing LDAP compatible, this seems to contain the author's login name
guid − allegedly a global resource identifier, this is complete fiction, and utterly unreliable for any purpose whatsoever. Ignore.
description − often blank, but sometimes set through WP's admin
content:encoded − the actual content of the post/page, subject to WordPress's wpautop function which turns \n\n as stored here, back into somewhat properly nested

…

tags, with a slew of exceptions and special handling. That mischmasch of logic is executed each time WP displays anything − WordPress, mangling your text each runtime since 2006!
post_id − WP's internal integer relation id for this post/page
comment_status − "open" or "closed". If open, consult Wikipedia under "xss" for an idea of they mayhem which might ensue.
ping_status − "open" or "closed". If open, an invitation to xmlrpc ddos the site for no discernable reason.
post_name − often blank if item is a Post, or the item's "slug" if a Page (or sometimes if a custom post type). Note that WP enforces that each item have a unique post_name, so you can't have pages with URL's /hotels/ma/springfield and /hotels/oh/springfield − you'll need to use something nasty like /hotels/ma/springfield-ma and /hotels/oh/springfield-oh. WordPress, mangling your URLs since 2007!
status − draft, published, private… other values may be legal. There has been some attempts to create real workflow systems in WP, giving users "capabilities" such as Subscriber, Contributor, Editor, Administrator. Implementing that correctly would imply putting other values here to indicate an item's position in that workflow. This is ragingly incomplete and subject to custom plugins mucking about.
post_parent − Theoretically only for Pages not Posts, this gives the parent's post_id in the hierarchy of pages.
menu_order − WP's half-hearted attempt to let you dictate the order of pages in the sidebar, the usefulness of this has been massacred by the "Menu" system introduced in WP 3.0 and in other ways that make administering a real hierarchical tree of pages nearly impossible.
post_type − post or page, or some other custom post type or taxonomy. Attempting to import a site with anything other than post or page here may be an exercise in frustration or futility.
is_sticky − 0 or 1, just when you thought it might be "true" or "false" or "yes" or "no", or maybe "vrai" or "faux", 真 or 假… PHP is fun like that. Nonzero if the post is supposed to somehow be sticky, although exactly what that means is left to the Theme Author as an exercise.
category − zero or more domain=category, post_tag, or custom taxonomy entries. Cross-referenced by the nicename value which must be unique across all taxonomies (same problem thinking with uniqueness as page slugs and URLs).
wp:postmeta − an array of wp:meta_key / wp:meta_value pairs which the user can edit through the WP admin. Sometimes used to fun effects by themes or plugins, these can actually do or be anything at all.

That's the more-or-less organization of this file. As you can see it delightfully shows that some of WordPress's foundations still suffer original cracks.

Example code:

#!/usr/bin/perl

use XML::Simple qw(:strict);
binmode(STDOUT, ":utf8");   # enable Unicode output

my $xs = XML::Simple->new();
my $ref = $xs->XMLin($ARGV[0], ForceArray => 1, KeyAttr => []);

use Data::Dumper;

use HTML::TreeBuilder;
sub rectify_html {

    my $munged_text = shift;
    $munged_text =~ s/\n\n/\n<p>/g;

    my @pre_segments;
    my $seg_id=0;
    $munged_text =~ s{<pre\b(.*?)>(.*?)</pre\s*>}{$pre_segments[++$seg_id]=$2; "<pre data-seg=\"$seg_id\"$1></pre>";}gsex;

    my $atree = HTML::TreeBuilder->new();

    # Prepare to store comments. Requires wrapping in <html><body>
    $atree->store_comments(1);
    $atree->parse("<html><body>$munged_text</body></html>");

    # Replace original text contents for <pre> elements
    foreach my $pre_element ($atree->look_down('_tag', 'pre')) {
    $pre_element->push_content($pre_segments[$pre_element->attr('data-seg')]);
    $pre_element->attr('data-seg',undef);
    }

    # print Dumper($atree);
    print $atree->as_HTML(undef, ' ', {});
}


# Here we show just the contents of the posts and pages.

my $posts = $ref->{channel}[0]->{item};

foreach my $post (@{$posts}) {
    if ($post->{'wp:post_type'}[0] =~ /^post|page$/) {
    print "Entry with ID=$post->{'wp:post_id'}[0] of type $post->{'wp:post_type'}[0] contains:\n";
    print $post->{'content:encoded'}[0];
    print "\n-----------------\n";
    print rectify_html($post->{'content:encoded'}[0]);
    print "\n-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-\n";
    } else {
    print Dumper($post->{'wp:post_type'});
    
    }
}